Know thy corpus! Robust methods for digital curation of Web corpora

机译：知道你的语料库！ Web Corpora数字策择的鲁棒方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a novel framework for digital curation of Web corpora in order to provide robust estimation of their parameters, such as their composition and the lexicon. In recent years language models pre-trained on large corpora emerged as clear winners in numerous NLP tasks, but no proper analysis of the corpora which led to their success has been conducted. The paper presents a procedure for robust frequency estimation, which helps in establishing the core lexicon for a given corpus, as well as a procedure for estimating the corpus composition via unsuperviscd topic models and via supervised genre classification of Web pages. The results of the digital curation study applied to several Web-derived corpora demonstrate their considerable differences. First, this concerns different frequency bursts which impact the core lexicon obtained from each corpus. Second, this concerns the kinds of texts they contain. For example, OpenWebText contains considerably more topical news and political argumentation in comparison to ukWac or Wikipedia. The tools and the results of analysis have been released.

机译：本文提出了一种用于Web Cotora的数字策策的新框架，以便为其参数提供鲁棒估计，例如它们的组成和词典。近年来，在大公司预培训的语言模型被出现在众多NLP任务中的明确获奖者，但没有适当分析导致其成功的基础。本文提出了一种稳健频率估计的过程，这有助于建立给定语料库的核心词汇，以及通过Unsuperviscd主题模型估计语料库组成的过程，并通过Web页面的监督类型分类。应用于几个Web衍生的Corpora的数字策策研究的结果证明了他们的差异相当大。首先，这涉及影响从每个语料库获得的核心词汇的不同频率突发。其次，这涉及它们包含的文本种类。例如，与UKWAC或维基百科相比，OpenWebBtext包含了相当多的内容新闻和政治论证。释放了工具和分析结果。

著录项

来源
《International Conference on Language Resources and Evaluation》|2020年|2453-2460|共8页
会议地点
作者
Serge Sharoff;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Validation of language resources; Text analytics; Language Modelling; Digital curation;

机译：验证语言资源;文本分析;语言建模;数字策策;

相似文献

外文文献
中文文献
专利

1. BAHASA INDONESIA TEXT CORPUS GENERATION USING WEB CORPORA APPROACHES [J] . AMALIA AMALIA, OPIM SALIM SITOMPUL, ERNA BUDHIARTI NABABAN, Journal of Theoretical and Applied Information Technology . 2019,第24期

机译：使用Web Corpora方法生成的BAHASA Indonesia Text Corpus
2. Historical corpora meet the digital humanities: the Jerusalem Corpus of Emergent Modern Hebrew [J] . Rubinstein Aynat Language Resources and Evaluation . 2019,第4期

机译：历史语料库与数字人文学科相遇：新兴的现代希伯来语耶路撒冷语料库
3. An Intelligent Web Digital Image Metadata Service Platform for Social Curation Commerce Environment [J] . Seong-YongHong, Sung-JoonLee Modelling and simulation in engineering . 2015,第1期

机译：用于社交策展商务环境的智能Web数字图像元数据服务平台
4. Integrating large-scale web data and curated corpus data in a search engine supporting German literacy education [C] . Sabrina Dittrich, Zarah Weiss, Hannes Schroeter, Workshop on natural language processing for computer assisted language learning . 2019

机译：在支持德国扫盲教育的搜索引擎中集成大型Web数据和策划的语料库数据
5. The Digital Curation of Broadcasting Archives at the Canadian Broadcasting Corporation: Curation Culture and Evaluative Practice [D] . Ivanov, Asen O. 2019

机译：加拿大广播公司广播档案的数字策划：策划文化与评价实践
6. Identifying e-books authored by faculty: a method for scoping the digital collection and curating a list [O] . Sonali Sugrim, Laura Schimming, Gali Halevi 2019

机译：识别教师编写的电子书：确定数字馆藏和整理清单的方法
7. Utilité du partage des corpus pour l'analyse des interactions en ligne en situation d'apprentissage : un exemple d'approche méthodologique autour d'une base de corpus d'apprentissage Benefits of Sharing Corpora when Analyzing Online Interactions: an Example of Methodology Related to a Databank of Learning and Teaching Corpora. [O] . Maud Ciekanski, Thierry Chanier 2010

机译：使用语料库分区来分析学徒情境中的交互：在分析在线交互时分享语料库的批准方法和学徒基础的好处示例：与方法相关的方法示例学习与教学语料库数据库。

Know thy corpus! Robust methods for digital curation of Web corpora

摘要

著录项

相似文献

相关主题

期刊订阅