【24h】

Lower Bias, Higher Density Abusive Language Datasets: A Recipe

机译:较低的偏见,较高密度的辱骂性语言数据集:一种食谱

获取原文

摘要

Datasets to train models for abusive language detection are both necessary and scarce. One reason for their limited availability is the cost of their creation. Manual annotation is expensive, and on top of it, the phenomenon itself is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vj. unfiltered datasets, and a more meaningful topic distribution after filtering.
机译:训练模型以进行虐待性语言检测的数据集既必要又稀缺。可用性有限的原因之一是其创建成本。手动注释非常昂贵,而且现象本身稀疏,导致人类注释者必须经过大量不相关的示例才能获取一些重要数据。迄今为止,用于提高滥用语言密度并获得更多有意义数据的策略包括基于预选关键字的数据过滤和令人讨厌的数据源。我们建议使用一种方法,该方法可以同时提供有意义的数据和可能更高的粗俗语言密度,并且还可以减少语料库创建者在选择要注释的数据时施加的自上而下的偏见。更具体地说,我们利用Reddit上有争议的渠道来获取用于过滤Twitter数据集的关键字。尽管该方法需要进一步验证和完善,但我们的初步实验表明,经过过滤的vj中的滥用推文密度较高。未过滤的数据集,以及过滤后更有意义的主题分布。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号