首页> 外文会议>Second workshop on abusive language online 2018 >Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs
【24h】

Boosting Text Classification Performance on Sexist Tweets by Text Augmentation and Text Generation Using a Combination of Knowledge Graphs

机译:通过结合知识图的文本增强和文本生成,提高性别歧视推文上的文本分类性能

获取原文
获取原文并翻译 | 示例

摘要

Text classification models have been heavily utilized for a slew of interesting natural language processing problems. Like any other machine learning model, these classifiers are very dependent on the size and quality of the training dataset. Insufficient and unbalanced datasets will lead to poor performance. An interesting solution to poor datasets is to take advantage of the world knowledge in the form of knowledge graphs to improve our training data. In this paper, we use ConceptNet and Wikidata to improve sexist tweet classification by two methods (1) text augmentation and (2) text generation. In our text generation approach, we generate new tweets by replacing words using data acquired from ConceptNet relations in order to increase the size of our training set, this method is very helpful with frustratingly small datasets, preserves the label and increases diversity. In our text augmentation approach, the number of tweets remains the same but their words are augmented (concatenation) with words extracted from their ConceptNet relations and their description extracted from Wikidata. In our text augmentation approach, the number of tweets in each class remains the same but the range of each tweet increases. Our experiments show that our approach improves sexist tweet classification significantly in our entire machine learning models. Our approach can be readily applied to any other small dataset size like hate speech or abusive language and text classificatbn problem using any machine learning model.
机译:文本分类模型已被大量用于一系列有趣的自然语言处理问题。像任何其他机器学习模型一样,这些分类器非常依赖于训练数据集的大小和质量。数据集不足和不平衡将导致性能不佳。一个针对不良数据集的有趣解决方案是利用知识图形式的世界知识来改进我们的训练数据。在本文中,我们使用ConceptNet和Wikidata通过两种方法(1)文本扩充和(2)文本生成来改进性别歧视推文分类。在我们的文本生成方法中,我们使用从ConceptNet关系中获取的数据替换单词来生成新的推文,以增加训练集的大小。此方法对于令人沮丧的小型数据集非常有用,可以保留标签并增加多样性。在我们的文本扩充方法中,推文的数量保持不变,但是用从其ConceptNet关系中提取的单词和从Wikidata中提取的描述来扩充(串联)其单词。在我们的文本扩充方法中,每个类别中的推文数量保持不变,但每个推文的范围都会增加。我们的实验表明,我们的方法在整个机器学习模型中都显着改善了性别歧视推文分类。我们的方法可以很容易地应用于任何其他小的数据集,例如使用任何机器学习模型的仇恨言论或辱骂性语言和文本分类问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号