首页> 外文会议>International Workshop on Semantic Evaluation >Ferryman at SemEval-2020 Task 12: BERT-Based Model with Advanced Improvement Methods for Multilingual Offensive Language Identification
【24h】

Ferryman at SemEval-2020 Task 12: BERT-Based Model with Advanced Improvement Methods for Multilingual Offensive Language Identification

机译:Semeval-2020的渡轮任务12:基于BERT的模型,具有多语言攻击性语言识别的高级改进方法

获取原文

摘要

Indiscriminately posting offensive remarks on social media may promote the occurrence of negative events such as violence, crime, and hatred. This paper examines different approaches and models for solving offensive tweet classification, which is a part of the OffensEval 2020 competition(Zampieri et al., 2020; Zampieri et al., 2019b). The dataset is Offensive Language Identification Dataset (OLID)(Zampieri et al., 2019a), which draws 14,200 annotated English Tweet comments(Rosenthal et al., 2020). The main challenge of data preprocessing is the unbalanced class distribution, abbreviation, and emoji. To overcome these issues, methods such as hashtag segmentation, abbreviation replacement, and emoji replacement have been adopted for data preprocessing approaches. The main task can be divided into three sub-tasks, and are solved by Term Frequency-Inverse Document Frequency(TF-IDF) vectorizer, Bidirectional Encoder Representation from Transformer (BERT), and Multi-dropout respectively. Meanwhile, we applied different learning rates for different languages and tasks based on BERT and non-BERTmodels in order to obtain better results. Our team Ferryman ranked the 18th, 8th, and 21st with F1-score of 0.91152 on the English Sub-task A, Sub-task B, and Sub-task C, respectively. Furthermore, our team also ranked in the top 20 on the Sub-task A of other languages(Coeltekin, 2020; Sigurbergsson and Derczynski, 2020; Mubarak et al., 2020; Pitenis et al., 2020).
机译:在社交媒体上不分青红皂白地发布令人反感言论可能会促进暴力,犯罪和仇恨等负面事件的发生。本文研究了解决进攻性推文分类的不同方法和模型,这是违法的2020年竞赛的一部分(Zampieri等,2020; Zampieri等,2019b)。 DataSet是令人攻击的语言识别数据集(OLID)(Zampieri等,2019A),它绘制了14,200名注释的英语推文评论(Rosenthal等,2020)。数据预处理的主要挑战是不平衡的类分布,缩写和表情符号。为了克服这些问题,已经采用了数据预处理方法,诸如HASHTAG分割,缩写替代和表情歌曲替换等方法。主要任务可以分为三个子任务,并且通过术语频率 - 逆文档频率(TF-IDF)矢量化器,来自变压器(BERT)的双向编码器表示和多丢失来解决。同时,我们基于BERT和非BERTMODELS应用不同语言和任务的不同学习率,以获得更好的结果。我们的团队渡轮分别在英语子任务A,子任务B和子任务C上排名第18,第8和第21次,为0.91152的F1分数。此外,我们的团队还在其他语言的子任务A(Coeltekin,2020; Sigurbergsson和Derczynski,2020; Mubarak等,2020)上的前20名。,2020; Pitenis等,2020)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号