Building a 70 billion word corpus of English from ClueWeb

机译：从ClueWeb建立一个700亿英语单词的语料库

页面导航

摘要
著录项
相似文献
相关主题

摘要

This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.

机译：这项工作描述了创建一个700亿个英语单词文本语料库的过程。我们使用现有的语言资源（即ClueWeb09数据集）作为语料库数据的源。处理如此大量的数据带来了一些挑战，主要与预处理（样板清洗，文本重复数据删除）和后处理（使用CQL –语料库查询语言进行索引以进行有效的语料库查询）相关。在本文中，我们说明了如何解决这些问题：我们描述了用于样板清洗（jusText）和重复数据删除（洋葱）的工具，这些工具不仅在完整（文档级）重复项上执行，而且还在接近重复的文字。此外，我们显示了每个执行的预处理步骤对最终语料库大小的影响。此外，我们展示了如何在海牛语料库管理系统中以及在从所得的语料库计算单词草图（单页，自动，语料库得出的语法和搭配行为的摘要）期间采用语料库索引过程的有效并行化。

著录项

作者
Pomikálek Jan; Rychlý Pavel; Jakubíček Miloš;
展开▼
作者单位

展开▼
年度 2012
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. Adjacent and Non-Adjacent Word Contexts Both Predict Age of Acquisition of English Words: A Distributional Corpus Analysis of Child-Directed Speech [J] . Chang Lucas M., Deak Gedeon O. Cognitive Science . 2020,第11期

机译：相邻和非相邻的词语上下文都预测收购年龄单词：儿童定向语音的分布语料库分析
2. Four-Word Bundles in English Abstracts of Chinese and English Linguistics Journal Articles:A Corpus-based Comparative Study [J] . ZHOU Yi-tao 文学与艺术研究：英文版 . 2021,第002期

机译：中英文语言学期刊文章中英文摘要的四字捆绑：基于语料库的比较研究
3. Subtitling Swear Words from English into Chinese: A Corpus-Based Study of &i&Big Little Lies&/i& [J] . Shuangjiao Wu Open Journal of Modern Linguistics . 2021,第2期

机译：从英语中展示咒语中的中文：基于语料库的＆ i＆大小谎言＆ / i＆
4. Building a 70 billion word corpus of English from ClueWeb [C] . Jan Pomikalek, Milos Jakubicek, Pavel Rychly International conference on language resources and evaluation . 2012

机译：从ClueWeb建立一个700亿英语单词的语料库
5. Lexical borrowing in a French-English email corpus: Integration of English words in the electronic discourse of French immigrants in America. [D] . Saugera, Valerie. 2007

机译：法语-英语电子邮件语料库中的词汇借用：将英语单词整合到美国法国移民的电子话语中。
6. Misperceptions of spoken words: Data from a random sample of American English words [O] . Robert Albert Felty, Adam Buchwald, Thomas M. Gruenenfelder, -1

机译：对口语单词的误解：来自美国英语单词的随机样本的数据
7. Building a training corpus for word sense disambiguation in the English-to-Vietnamese Machine Translation [O] . Dien Dinh 2002

机译：在英语到越南机器翻译中为单词义消除建立训练语料库

Building a 70 billion word corpus of English from ClueWeb

摘要

著录项

相似文献

相关主题

期刊订阅