首页> 外文会议>Annual Meeting of the Association for Computational Linguistics;International Joint Conference on natural Language Processing >What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus
【24h】

What's in the Box? A Preliminary Analysis of Undesirable Content in the Common Crawl Corpus

机译:盒子里有什么东西? 普通爬行语料库中不良内容的初步分析

获取原文

摘要

Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We discuss the potential impacts of this content on language models and conclude with future research directions and a more mindful approach to corpus collection and analysis.
机译:而当前一代神经语言模型的成功越来越大,而越来越大的培训集团,研究已经致力于分析这些大规模文本数据来源的研究。 在这一探索性分析中,我们深入了解共同的爬网,巨大的Web语料库广泛用于培训语言模型。 我们发现它包含大量不良内容,包括仇恨语音和性明确的内容,即使过滤程序也是如此。 我们讨论了这种内容对语言模型的潜在影响,并与未来的研究方向得出结论,以及一种更为谨慎的语料库收集和分析方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号