首页> 外文会议> >Rapid creation of large-scale corpora and frequency dictionaries
【24h】

Rapid creation of large-scale corpora and frequency dictionaries

机译:快速创建大型语料库和频率词典

获取原文

摘要

We describe, and make public, large-scale language resources and the toolchain used in their creation, for fifteen medium density European languages: Catalan, Czech, Croatian, Danish, Dutch, Finnish, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Serbian, Slovak, Spanish, and Swedish. To make the process uniform across languages, we selected tools that are either language-independent or easily customizable for each language, and reimplemented all stages that were taking too long. To achieve processing times that are insignificant compared to the time data collection (crawling) takes, we reimplemented the standard sentence- and word-level tokenizers and created new boilerplate and near-duplicate detection algorithms. Preliminary experiments with non-European languages indicate that our methods are now applicable not just to our sample, but the entire population of digitally viable languages, with the main limiting factor being the availability of high quality stemmers.
机译:我们针对15种中等密度的欧洲语言描述并公开其大规模语言资源及其创建过程中使用的工具链:加泰罗尼亚语,捷克语,克罗地亚语,丹麦语,荷兰语,芬兰语,立陶宛语,挪威语,波兰语,葡萄牙语,罗马尼亚语,塞尔维亚语,斯洛伐克文,西班牙文和瑞典文。为了使流程跨语言统一,我们选择了与语言无关或可以轻松自定义每种语言的工具,并重新实现了耗时太长的所有阶段。为了实现与数据收集(抓取)所花费的时间相比微不足道的处理时间,我们重新实现了标准的句子级和单词级标记器,并创建了新的样板和近似重复的检测算法。使用非欧洲语言的初步实验表明,我们的方法现在不仅适用于我们的样本,而且适用于整个数字可行语言群体,主要限制因素是高质量词干的可用性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号