Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

Oluwafemi Oriola; Eduan Kotze?

摘要

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

机译：半监督学习是一种潜在的解决方案，用于改善低调滥用语言检测上下文中的培训数据，例如在Twitter上的南非滥用语言检测。但是，现有的半监督学习方法已经倾斜朝向少量标记数据，具有小的特征空间。因此，本文提出了一种半监督学习技术，通过将标签分配给未标记的数据的大多数投票来提高训练数据的分布，这些数据通过标记和未标记的数据集群的不同特征集。该技术适用于由标签和未标记的滥用推文组成的南非英语学报。基于句法和语义特征，将所提出的技术与最先进的自学习和主动学习技术进行比较。评估具有逻辑回归，支持向量机和神经网络的这些技术的性能。所提出的技术，精度和F1分数分别为0.97和0.95，优于现有的半监督学习技术。学习曲线表明，与现有技术相比，通过所提出的技术更有效地使用训练数据。总体而言，n-gram句法功能，具有逻辑回归分类器的最高性能。本文得出结论，提出的半监督学习技术有效地检测到Twitter上隐含和明确的南非滥用语言。

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

摘要

著录项

相关主题

期刊订阅