Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data

机译：翻译和分类：提高英语 - 印度码混合数据的序列级别分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark - Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.

机译：代码混合是世界各地多语种社团的常见现象，在社交媒体文本中特别常见。传统的NLP系统，通常在单语演模型上培训，在Code-Micric文本上不太良好。由于缺乏大规模数据集，培训代码切换文本的专业型号很难。将码混合数据转换为如英语的标准语言可以提高各种代码混合任务的性能，因为我们可以使用最先进的英语模型来处理翻译数据的传输学习。本文重点介绍英语 - 印度代码混合文本的两个序列级别分类任务，这是GlueCOS基准的一部分 - 自然语言推断和情感分析。我们建议使用各种预先训练的模型，该模型对于类似的英语仅限于类似的英语，并显示出最先进的性能。我们在翻译的代码混合数据集中进一步微调这些模型，并在两个任务中实现最先进的性能。要将英文 - 印地文代码混合数据转换为英语，我们使用MBart，这是一个预先训练的多语言序列到序列模型，这些模型在各种低资源机器翻译对上显示了竞争性能，并且还显示了语言的性能增益不在其预训练中的语料库中。

著录项

来源
《Workshop on Computational Approaches to Linguistic Code-Switching》|2021年|15-25|共11页
会议地点
作者
Devansh Gautam; Kshitij Gupta; Manish Shrivastava;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. SPECtre: a spectral coherence--based classifier of actively translated transcripts from ribosome profiling sequence data [J] . Sang Y. Chun, Caitlin M. Rodriguez, Peter K. Todd, BMC Bioinformatics . 2016,第1期

机译：SPECtre：来自核糖体谱分析序列数据的主动翻译转录本的基于光谱一致性的分类器
2. Combining classifiers for improved classification of proteins from sequence or structure [J] . Iain Melvin, Jason Weston, Christina S Leslie, BMC Bioinformatics . 2008,第1期

机译：组合分类器以改善蛋白质从序列或结构的分类
3. Improved Bacterial Foraging Optimization based Twin Support Vector Machine (IBFO-TSVM) Classifier for Risk Level Classification of Coronary Artery Heart Disease in Diabetic Patients [J] . Rajkumar R., Anandakumar K., Bharathi A. International Journal of Applied Engineering Research . 2018,第3aPta2期

机译：基于细菌觅食优化的双支持向量机（IBFO-TSVM）分类器，用于糖尿病患者冠状动脉心脏病的风险水平分类
4. Improving the Accuracy of Classifiers for the Prediction of Translation Initiation Sites in Genomic Sequences [C] . George Tzanis, Christos Berberidis, Anastasia Alexandridou, Panhellenic Conference Informatics(PCI 2005); 20051111-13; Volos(GR) . 2005

机译：提高分类器预测基因组序列中翻译起始位点的准确性
5. Classifier design to improve pattern classification and knowledge discovery for imbalanced datasets. [D] . Wang, Kun. 2009

机译：分类器设计可改进模式分类和不平衡数据集的知识发现。
6. Combining classifiers for improved classification of proteins from sequence or structure [O] . Iain Melvin, Jason Weston, Christina S Leslie, 2008

机译：组合分类器以改善蛋白质从序列或结构的分类
7. DCU-UVT: Word-Level Language Classification with Code-Mixed Data [O] . Utsab Barman, Joachim Wagner, Grzegorz Chrupała, 2015

机译：DCU-UVT：使用代码混合数据的字级语言分类

Translate and Classify: Improving Sequence Level Classification for English-Hindi Code-Mixed Data

摘要

著录项

相似文献

相关主题

期刊订阅