首页> 外文会议>International Conference on Computing, Communication and Intelligent Systems >A Parsimonious and Practical Approach to Detecting Offensive Speech
【24h】

A Parsimonious and Practical Approach to Detecting Offensive Speech

机译:一种促进令人攻击性言论的促销和实用方法

获取原文

摘要

With the proliferation of hateful and offensive speech on social media platforms such as Twitter, machine learning approaches to detect such toxic content have gained prominence. Despite these advances, real-time detection of such speech, while it is being shared on these platforms, remains a challenge for two reasons. First, these approaches train complex models on a plethora of features, which calls into question their computational efficiency for real-time deployment. Moreover, they require sizeable, manually annotated data sets from the same context, and annotating large data sets is extremely time-consuming, error-prone and cumbersome. This paper proposes a parsimonious and practical approach for the detection of offensive speech that alleviates these challenges. The approach is parsimonious because through a comprehensive evaluation of commonly used machine learning models (Logistic Regression, Random Forest, Neural Networks) on two public domain data sets it demonstrates that a simple Logistic Regression model trained on unigrams with frequency counts can detect hate speech with high accuracy of over 90%. It is practical because it demonstrates how an existing labeled training data set can be used to train models that can detect offensive content from a completely unknown data set with moderate accuracy. Based on these findings, the paper offers guidance on the characteristics that may be desirable in benchmark training data sets for offensive speech detection.
机译:随着仇恨和令人反感的演讲的扩散,如Twitter等社交媒体平台,机器学习方法检测此类有毒内容的突出突出。尽管有这些预付款,但是在这些平台上共享的这种演讲的实时检测仍然是一个挑战。首先,这些方法在一流的特征上培训复杂模型,该功能调用它们的实时部署的计算效率。此外,它们需要相同的,手动注释的数据集,并且注释大数据集是非常耗时的,容易出错和繁琐的。本文提出了一种令人杀了和实用的方法,用于检测减轻这些挑战的令人攻击性致辞。该方法是解放的,因为在两个公共领域的数据集上综合评估常用的机器学习模型(Logistic回归,随机森林,神经网络),它表明,在频率计数上训练的简单逻辑回归模型可以检测仇恨语音高精度超过90%。它实用,因为它演示了现有的标签训练数据集如何用于培训可以从具有中等精度的完全未知的数据集中检测冒犯内容的模型。基于这些发现,本文提供了对基于基准训练数据集可用于令人反感的语音检测的特征的指导。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号