首页> 外文会议>Workshop on Computational Approaches to Linguistic Code-Switching >Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data
【24h】

Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data

机译:翻译和分类:提高英语 - 印度码混合数据的序列级别分类

获取原文

摘要

Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark - Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.
机译:代码混合是世界各地多语种社团的常见现象,在社交媒体文本中特别常见。传统的NLP系统,通常在单语演模型上培训,在Code-Micric文本上不太良好。由于缺乏大规模数据集,培训代码切换文本的专业型号很难。将码混合数据转换为如英语的标准语言可以提高各种代码混合任务的性能,因为我们可以使用最先进的英语模型来处理翻译数据的传输学习。本文重点介绍英语 - 印度代码混合文本的两个序列级别分类任务,这是GlueCOS基准的一部分 - 自然语言推断和情感分析。我们建议使用各种预先训练的模型,该模型对于类似的英语仅限于类似的英语,并显示出最先进的性能。我们在翻译的代码混合数据集中进一步微调这些模型,并在两个任务中实现最先进的性能。要将英文 - 印地文代码混合数据转换为英语,我们使用MBart,这是一个预先训练的多语言序列到序列模型,这些模型在各种低资源机器翻译对上显示了竞争性能,并且还显示了语言的性能增益不在其预训练中的语料库中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号