首页> 外文会议>International Workshop on Semantic Evaluation >JCT at SemEval-2020 Task 12: Offensive Language Detection in Tweets using Preprocessing Methods, Character and Word N-grams
【24h】

JCT at SemEval-2020 Task 12: Offensive Language Detection in Tweets using Preprocessing Methods, Character and Word N-grams

机译:JCT在Semeval-2020任务12:使用预处理方法,字符和单词n-grams的推文中的攻击性语言检测

获取原文

摘要

In this paper, we describe our submissions to SemEval-2020 contest. We tackled subtask 12 -"Multilingual Offensive Language Identification in Social Media". We developed different models for four languages: Arabic, Danish, Greek, and Turkish. We applied three supervised machine learning methods using various combinations of character and word n-gram features. In addition, we applied various combinations of basic preprocessing methods. Our best submission was a model we built for offensive language identification in Danish using Random Forest. This model was ranked at the 6th position out of 39 submissions. Our result is lower by only 0.0025 than the result of the team that won the 4th place using entirely non-neural methods. Our experiments indicate that char ngram features are more helpful than word ngram features. This phenomenon probably occurs because tweets are more characterized by characters than by words, tweets are short, and contain various special sequences of characters, e.g., hashtags, shortcuts, slang words, and typos.
机译:在本文中,我们将我们的意见书描述为Semeval-2020比赛。我们解决了Subtask 12 - “社交媒体的多语言攻击语言识别”。我们开发了四种语言的不同型号:阿拉伯语,丹麦语,希腊语和土耳其语。我们使用各种字符和单词N-GRAM功能组合应用三种监督机器学习方法。此外,我们应用了基本预处理方法的各种组合。我们最好的提交是我们在使用随机森林的丹麦语中为冒犯语言识别而建立的模型。该模型在39份提交中排名第6位。我们的结果仅低0.0025,而不是使用完全非神经方法赢得第四位的团队的结果。我们的实验表明,Char Ngram功能比单词Ngram功能更有助于。这种现象可能出现,因为Tweets的特征是字符而不是单词,推文是短的,并且包含各种特殊的字符序列,例如,HASHTAG,快捷方式,俚语和拼写字符。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号