首页> 外文OA文献 >Building a 70 billion word corpus of English from ClueWeb
【2h】

Building a 70 billion word corpus of English from ClueWeb

机译:从ClueWeb建立一个700亿英语单词的语料库

摘要

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.
机译:这项工作描述了创建一个700亿个英语单词文本语料库的过程。我们使用现有的语言资源(即ClueWeb09数据集)作为语料库数据的源。处理如此大量的数据带来了一些挑战,主要与预处理(样板清洗,文本重复数据删除)和后处理(使用CQL –语料库查询语言进行索引以进行有效的语料库查询)相关。在本文中,我们说明了如何解决这些问题:我们描述了用于样板清洗(jusText)和重复数据删除(洋葱)的工具,这些工具不仅在完整(文档级)重复项上执行,而且还在接近重复的文字。此外,我们显示了每个执行的预处理步骤对最终语料库大小的影响。此外,我们展示了如何在海牛语料库管理系统中以及在从所得的语料库计算单词草图(单页,自动,语料库得出的语法和搭配行为的摘要)期间采用语料库索引过程的有效并行化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号