首页> 外文会议>Workshop on Language Technology for Equality, Diversity and Inclusion >CFILT IIT Bombay@LT-EDI-EACL2021: Hope Speech Detection for Equality, Diversity, and Inclusion using Multilingual Representation from Transformers
【24h】

CFILT IIT Bombay@LT-EDI-EACL2021: Hope Speech Detection for Equality, Diversity, and Inclusion using Multilingual Representation from Transformers

机译:CFilt IIT Bombay @ LT-EDI-EACL2021:希望使用来自变压器的多语言表示的平等,分集和包含的语音检测

获取原文

摘要

With the internet becoming part and parcel of our lives, engagement in social media has increased a lot. Identifying and eliminating offensive content from social media has become of utmost priority to prevent any kind of violence. However, detecting encouraging, supportive and positive content is equally important to prevent misuse of censorship targeted to attack freedom of speech. This paper presents our system for the shared task Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI, EACL 2021. The data for this shared task is provided in English, Tamil, and Malayalam which was collected from YouTube comments. It is a multi-class classification problem where each data instance is categorized into one of the three classes: 'Hope speech'. 'Not hope speech', and 'Not in intended language'. We propose a system that employs multilingual transformer models to obtain the representation of text and classifies it into one of the three classes. We explored the use of multilingual models trained specifically for Indian languages along with generic multilingual models. Our system was ranked 2nd for English, 2nd for Malayalam, and 7th for the Tamil language in the final leader board published by organizers and obtained a weighted F1-score of 0.92, 0.84, 0.55 respectively on the hidden test dataset used for the competition. We have made our system publicly available at GitHub.
机译:随着互联网成为我们生活的一部分和包裹,社交媒体的参与增加了很多。识别和消除社交媒体的冒犯内容已成为预防任何文化的优先事项。然而,检测鼓励,支持性和积极的内容同样重要,无法防止滥用攻击攻击言论自由的审查。本文介绍了我们的共享任务的系统希望语音检测在LT-EDI中的平等,分集和包含在LT-EDI,EACE 2021中。此共享任务的数据以英文,泰米尔和马拉雅拉姆提供,这些任务是从YouTube评论中收集的。它是一个多级分类问题,每个数据实例被分类为三类之一:'希望演讲'。 '不希望演讲',并“不是预期的语言”。我们提出了一个使用多语言变压器模型的系统来获取文本的表示,并将其分类为三个类中的一个。我们探讨了使用专门用于印度语言的多语种模型以及通用的多语言模型。我们的系统在组织者发布的最终领导人董事会中排名第2,为Malayalam进行了第2名,第7位泰米尔语言,并分别在用于竞争的隐藏测试数据集中获得了0.92,0.84,0.55的加权F1分数。我们已经在Github上公开提供了我们的系统。

著录项

相似文献

  • 外文文献
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号