Lower Bias, Higher Density Abusive Language Datasets: A Recipe

机译：较低的偏见，较高密度的辱骂性语言数据集：一种食谱

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Datasets to train models for abusive language detection are both necessary and scarce. One reason for their limited availability is the cost of their creation. Manual annotation is expensive, and on top of it, the phenomenon itself is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vj. unfiltered datasets, and a more meaningful topic distribution after filtering.

机译：训练模型以进行虐待性语言检测的数据集既必要又稀缺。可用性有限的原因之一是其创建成本。手动注释非常昂贵，而且现象本身稀疏，导致人类注释者必须经过大量不相关的示例才能获取一些重要数据。迄今为止，用于提高滥用语言密度并获得更多有意义数据的策略包括基于预选关键字的数据过滤和令人讨厌的数据源。我们建议使用一种方法，该方法可以同时提供有意义的数据和可能更高的粗俗语言密度，并且还可以减少语料库创建者在选择要注释的数据时施加的自上而下的偏见。更具体地说，我们利用Reddit上有争议的渠道来获取用于过滤Twitter数据集的关键字。尽管该方法需要进一步验证和完善，但我们的初步实验表明，经过过滤的vj中的滥用推文密度较高。未过滤的数据集，以及过滤后更有意义的主题分布。

著录项

来源
《Conference on Resources and Techniques for User and Author Profiling in Abusive Language》|2020年|14-19|共6页
会议地点
作者
Juliet van Rosendaal; Tommaso Caselli; Malvina Nissim;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets? [J] . Paula Fortuna, Juan Soler-Company, Leo Wanner Information Processing & Management . 2021,第3期

机译：仇恨言语，毒性，滥用和令人反感的语言分类模型如何概括到数据集？
2. A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media [J] . Muhammad Okky Ibrohim, Indra Budi Procedia Computer Science . 2018,第1期

机译：印尼社交媒体中滥用语言检测的数据集和初步研究
3. Comparison of global datasets of sodium densities in the mesosphere and lower thermosphere from GOMOS, SCIAMACHY and OSIRIS measurements and WACCM model simulations from 2008 to 2012 [J] . Langowski Martin P., Savigny Christian von, Burrows John P., Atmospheric Measurement Techniques . 2017,第8期

机译：从2008年至2012年通过GOMOS，SCIAMACHY和OSIRIS测量以及WACCM模型模拟比较的中层和下层热层全球钠密度数据集的比较
4. Racial Bias in Hate Speech and Abusive Language Detection Datasets [C] . Thomas Davidson, Debasmita Bhattacharya, Ingmar Weber Workshop on Abusive language online;Annual meeting of the Association for Computational Linguistics . 2019

机译：仇恨语音和辱骂性语言检测数据集中的种族偏见
5. Detecting, Quantifying, and Mitigating Bias in Malware Datasets [D] . Seymour , John Jefferson, III. 2020

机译：在恶意软件数据集中检测，量化和缓解偏差
6. Directions in abusive language training data a systematic review: Garbage in garbage out [O] . Bertie Vidgen, Leon Derczynski 2020

机译：在辱骂语言培训数据中的指示系统评价：垃圾垃圾
7. Studying Generalisability across Abusive Language Detection Datasets [O] . Steve Durairaj Swamy, Anupam Jamatia, Björn Gambäck 2019

机译：研究跨滥用语言检测数据集的恒定性

Lower Bias, Higher Density Abusive Language Datasets: A Recipe

摘要

著录项

相似文献

相关主题

期刊订阅