What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

机译：盒子里有什么东西？普通爬行语料库中不良内容的初步分析

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.

机译：而当前一代神经语言模型的成功越来越大，而越来越大的培训集团，研究已经致力于分析这些大规模文本数据来源的研究。在这一探索性分析中，我们深入了解共同的爬网，巨大的Web语料库广泛用于培训语言模型。我们发现它包含大量不良内容，包括仇恨语音和性明确的内容，即使过滤程序也是如此。我们讨论了这种内容对语言模型的潜在影响，并与未来的研究方向得出结论，以及一种更为谨慎的语料库收集和分析方法。

著录项

来源
《Annual Meeting of the Association for Computational Linguistics;International Joint Conference on natural Language Processing》|2021年|182-189|共8页
会议地点
作者
Alexandra (Sasha) Luccioni; Joseph D. Viviano;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Towards building a Urdu Language Corpus using Common Crawl [J] . Shafiq Hafiz Muhammad, Tahir Bilal, Mehmood Muhammad Amir Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2020,第2Pta2期

机译：使用常见爬网构建乌尔都语语言语料库
2. THE TRADE-OFF BETWEEN QUANTITY AND QUALITY. COMPARING A LARGE CRAWLED CORPUS AND A SMALL FOCUSED CORPUS FOR MEDICAL TERMINOLOGY EXTRACTION [J] . Hoste Veronique, Vanopstal Klaar, Terryn Ayla Rigouts, Nature reviews neuroscience . 2019,第2期

机译：数量和质量之间的权衡。比较大型爬行的语料库和用于医学术语提取的小型专注语料库
3. A Common Set of Weights for Ranking Decision-Making Units with Undesirable Outputs: A Double Frontiers Data Envelopment Analysis Approach [J] . Chen Lei, Wu Fei-Mei, Feng Feng, Asia-Pacific Journal of Operational Research . 2018,第6期

机译：产出不理想的决策单位排名的一组通用权重：双重边界数据包络分析方法
4. Understanding regional context of World Wide Web using common crawl corpus [C] . Muhammad Amir Mehmood, Hafiz Muhammad Shafiq, Abdul Waheed IEEE Malaysia International Conference on Communications . 2017

机译：使用常见爬网语料库了解万维网的区域上下文
5. Supports for Nontraditional Students in Higher Education: A Summative Content Analysis Using a Corpus-Based Approach [D] . ?Anderson-Johnson, Alexandria 2020

机译：高等教育非传统学生的支持：使用基于语料库的方法进行总结内容分析
6. Content and Structure of Clinical Problem Lists: A Corpus Analysis [O] . Tielman T. Van Vleck, Adam Wilcox, Peter D. Stetson, 2008

机译：临床问题清单的内容和结构：语料库分析
7. A Preliminary Examination of Local Currencies : "Thinking Outside the Box Will Become Common, as the Box Dissolves" [O] . 亜細亜大学経営学会 2013

机译：本地货币的初步检查：“随着盒子的溶解，盒子外面的思维将变得普遍”

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

摘要

著录项

相似文献

相关主题

期刊订阅