首页> 外文会议>Conference on empirical methods in natural language processing >A Review of Standard Text Classification Practices for Multi-label Toxicity Identification of Online Content

A Review of Standard Text Classification Practices for Multi-label Toxicity Identification of Online Content




Language toxicity identification presents a gray area in the ethical debate surrounding freedom of speech and censorship. Today's social media landscape is littered with unfiltered content that can be anywhere from slightly abusive to hate inducing. In response, we focused on training a multi-label classifier to detect both the type and level of toxicity in online content. This content is typically colloquial and conversational in style. Its classification therefore requires huge amounts of annotated data due to its variability and inconsistency. We compare standard methods of text classification in this task. A conventional one-vs-rest SVM classifier with character and word level frequency-based representation of text reaches 0.9763 ROC AUC score. We demonstrated that leveraging more advanced technologies such as word embeddings, recurrent neural networks, attention mechanism, stacking of classifiers and semi-supervised training can improve the ROC AUC score of classification to 0.9862. We suggest that in order to choose the right model one has to consider the accuracy of models as well as inference complexity based on the application.
机译:语言毒性识别在围绕言论自由和审查自由的道德辩论中呈现出灰色区域。今天的社交媒体景观与未过滤的内容乱丢,这些内容可以是任何地方,从略微辱骂以仇恨诱导。作为响应,我们专注于培训多标签分类器,以检测在线内容中的毒性类型和级别。这种内容通常是口语和型风格的对话。因此,由于其变异性和不一致,其分类需要大量的注释数据。我们在此任务中比较文本分类的标准方法。具有字符和字级基于频率的文本表示的传统单VS-REST SVM分类器达到0.9763 Roc AUC分数。我们证明,利用更先进的技术,如Word Embeddings,经常性神经网络,注意机制,堆叠分类器和半监督培训可以将Roc Auc分数的分类提高到0.9862。我们建议,为了选择正确的模型,必须考虑模型的准确性以及根据应用程序的推论复杂性。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号