首页> 外文会议>International Conference on Language Resources and Evaluation >Know thy corpus! Robust methods for digital curation of Web corpora
【24h】

Know thy corpus! Robust methods for digital curation of Web corpora

机译:知道你的语料库! Web Corpora数字策择的鲁棒方法

获取原文

摘要

This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a procedure for estimating the corpus composition via unsuperviscd topic models and via supervised genre classification of Web pages. The results of the digital curation study applied to several Web-derived corpora demonstrate their considerable differences. First, this concerns different frequency bursts which impact the core lexicon obtained from each corpus. Second, this concerns the kinds of texts they contain. For example, OpenWebText contains considerably more topical news and political argumentation in comparison to ukWac or Wikipedia. The tools and the results of analysis have been released.
机译:本文提出了一种用于Web Cotora的数字策策的新框架,以便为其参数提供鲁棒估计,例如它们的组成和词典。近年来,在大公司预培训的语言模型被出现在众多NLP任务中的明确获奖者,但没有适当分析导致其成功的基础。本文提出了一种稳健频率估计的过程,这有助于建立给定语料库的核心词汇,以及通过Unsuperviscd主题模型估计语料库组成的过程,并通过Web页面的监督类型分类。应用于几个Web衍生的Corpora的数字策策研究的结果证明了他们的差异相当大。首先,这涉及影响从每个语料库获得的核心词汇的不同频率突发。其次,这涉及它们包含的文本种类。例如,与UKWAC或维基百科相比,OpenWebBtext包含了相当多的内容新闻和政治论证。释放了工具和分析结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号