Rapid creation of large-scale corpora and frequency dictionaries

机译：快速创建大型语料库和频率词典

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe, and make public, large-scale language resources and the toolchain used in their creation, for fifteen medium density European languages: Catalan, Czech, Croatian, Danish, Dutch, Finnish, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Serbian, Slovak, Spanish, and Swedish. To make the process uniform across languages, we selected tools that are either language-independent or easily customizable for each language, and reimplemented all stages that were taking too long. To achieve processing times that are insignificant compared to the time data collection (crawling) takes, we reimplemented the standard sentence- and word-level tokenizers and created new boilerplate and near-duplicate detection algorithms. Preliminary experiments with non-European languages indicate that our methods are now applicable not just to our sample, but the entire population of digitally viable languages, with the main limiting factor being the availability of high quality stemmers.

机译：我们针对15种中等密度的欧洲语言描述并公开其大规模语言资源及其创建过程中使用的工具链：加泰罗尼亚语，捷克语，克罗地亚语，丹麦语，荷兰语，芬兰语，立陶宛语，挪威语，波兰语，葡萄牙语，罗马尼亚语，塞尔维亚语，斯洛伐克文，西班牙文和瑞典文。为了使流程跨语言统一，我们选择了与语言无关或可以轻松自定义每种语言的工具，并重新实现了耗时太长的所有阶段。为了实现与数据收集（抓取）所花费的时间相比微不足道的处理时间，我们重新实现了标准的句子级和单词级标记器，并创建了新的样板和近似重复的检测算法。使用非欧洲语言的初步实验表明，我们的方法现在不仅适用于我们的样本，而且适用于整个数字可行语言群体，主要限制因素是高质量词干的可用性。

著录项

来源
《》|2012年|1462-1465|共4页
会议地点
作者
Attila Zseder; Gabor Recski; Daniel Varga; Andras Kornai;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Web corpus; frequency dictionary; hun~* tools;

机译：网络语料库;频率字典hun〜*工具;

相似文献

外文文献
中文文献
专利

1. Case-sensitive letter and bigram frequency counts from large-scale English corpora [J] . MICHAEL N. JONES, D. J. K. MEWHORT Behavior Research Methods, Instruments & Computers . 2004,第3期

机译：大型英语语料库中区分大小写的字母和双字母字符的频率计数
2. Demonstration of the frequency upshifting of microwave radiation by rapid plasma creation [J] . Joshi C.J., Clayton C.E. IEEE Transactions on Plasma Science . 1990,第5期

机译：通过快速产生等离子体演示微波辐射的频率上移
3. Multi -tool selection model for the error control of mid -to -high frequency and rapid fabrication on large-scale aspheric optics [J] . Du Hang, Song Ci, Li Shengyi, Optik: Zeitschrift fur Licht- und Elektronenoptik: = Journal for Light-and Electronoptic . 2020,第1期

机译：大型非球面光学频率误差控制误差控制多功能选择模型
4. Rapid creation of large-scale corpora and frequency dictionaries [C] . Attila Zséder, Gábor Recski, Dániel Varga, LREC-2012 . 2012

机译：快速创建大型语料库和频率词典
5. Rapid creation of photorealistic large-scale urban city models. [D] . Poullis, Charalambos. 2009

机译：快速创建逼真的大型城市模型。
6. Concept Dictionary Creation and Maintenance Under Resource Constraints: Lessons from the AMPATH Medical Record System [O] . Martin C. Were, Burke W. Mamlin, William M. Tierney, 2007

机译：资源约束下的概念词典创建和维护：AMPATH病历系统的经验教训
7. Using Collections and Worksets in Large-Scale Corpora: Preliminary Findings from the Workset Creation for Scholarly Analysis Project [O] . 2014

机译：在大型语料库中使用集合和工作组：学术分析项目的工作组创作初步调查结果
8. Rapid Creation of Large-Scale 3D Models. [R] . Neumann, U., You, S. 2013

机译：大规模三维模型的快速创建。

Rapid creation of large-scale corpora and frequency dictionaries

摘要

著录项

相似文献

相关主题

期刊订阅