Low-Resource Text Classification via Cross-lingual Language Model Fine-tuning

机译：通过跨语言模型微调的低资源文本分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text classification tends to be difficult when data arc inadequate considering the amount of manually labeled text corpora. For low-resource agglutinative languages including Uyghur, Kazakh, and Kyrgyz (UKK languages), in which words are manufactured via stems concatenated with several suffixes and stems are used as the representation of text content, this feature allows infinite derivatives vocabulary that leads to high uncertainty of writing forms and huge redundant features. There are major challenges of low-resource agglutinative text classification the lack of labeled data in a target domain and morphologic diversity of derivations in language structures. It is an effective solution which fine-tuning a pre-trained language model to provide meaningful and favorable-to-use feature extractors for downstream text classification tasks. To this end, we propose a low-rcsourcc agglutinative language model fine-tuning AgglutiFiT, specifically, we build a low-noise fine-tuning dataset by morphological analysis and stem extraction, then fine-tune the cross-lingual pre-training model on this dataset. Moreover, we propose an attention-based fine-tuning strategy that better selects relevant semantic and syntactic information from the pre-trained language model and uses those features on downstream text classification tasks. We evaluate our methods on nine Uyghur, Kazakh, and Kyrgyz classification datasets, where they have significantly better performance compared with several strong baselines.

机译：当数据弧考虑手动标记的文本语料库数量不足时，文本分类往往很困难。对于包括UYGHUR，哈萨克和吉尔吉斯（UKK语言）的低资源凝聚语言，其中通过用几个后缀和茎串联制造的单词用作文本内容的表示，这一功能允许导致高的无限衍生物词汇表写作形式和巨大冗余功能的不确定性。低资源凝集文本分类存在重大挑战，在语言结构中缺乏标记数据和语言结构中的衍生形态的多样性。它是一个有效的解决方案，它微调预先训练的语言模型，以提供用于下游文本分类任务的有意义和有利的功能提取器。为此，我们提出了一个低RCSOUCCCLUTINATIVE语言模型的微调凝聚，具体而言，我们通过形态分析和干提取来构建低噪声微调数据集，然后微调交叉训练的预训练模型这个数据集。此外，我们提出了一种基于关注的微调策略，可以从预先训练的语言模型中选择相关的语义和句法信息，并使用下游文本分类任务的这些功能。我们在九所UYGHUR，哈萨克和吉尔吉斯分类数据集中评估我们的方法，与几个强大的基线相比，它们具有明显更好的性能。

著录项

来源
《Chinese National Conference on Computational Linguistic》|2020年|994-1005|共12页
会议地点
作者
Xiuhong Li; Zhe Li; Jiabao Sheng; Wushour Slamu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Cross-Lingual Language Modeling for Low-Resource Speech Recognition [J] . Ping Xu, Fung P. Audio, Speech, and Language Processing, IEEE Transactions on . 2013,第6期

机译：低资源语音识别的跨语言建模
2. Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages [J] . Saurav Jha, Akhilesh Sudhakar, Anil Kumar Singh Journal of Language Modelling . 2019,第2期

机译：学习跨语言的语音和拼字法适应：改进低资源语言之间的神经机器翻译的案例研究
3. Automatic Wordnet Development for Low-Resource Languages using Cross-Lingual WSD [J] . Faili Hesham, Taghizadeh Nasrin The Journal of Artificial Intelligence Research . 2016,第10期

机译：使用跨语言WSD的低资源语言自动Wordnet开发
4. Low-Resource Text Classification via Cross-Lingual Language Model Fine-Tuning [C] . Xiuhong Li, Zhe Li, Jiabao Sheng, China National Conference on Computational Linguistics . 2020

机译：通过跨语言模型微调的低资源文本分类
5. Learning Deep Representations for Low-resource Cross-lingual Natural Language Processing [D] . Chen, Xilun. 2019

机译：学习深度表示资源少的跨语言自然语言处理
6. Enhancing African low-resource languages: Swahili data for language modelling [O] . Casper S. Shikali, Refuoe Mokhosi 2020

机译：增强非洲低资源语言：语言建模的斯瓦希里语数据
7. Cross-lingual sentiment classification in low-resource Bengali language [O] . Salim Sazzed 2020

机译：低资源孟加拉语语言的交叉语言情绪分类
8. Cross-Lingual Lexical Triggers in Statistical Language Modeling [R] . Kim, W. , Khudanpur, S. 2003

机译：统计语言建模中的跨语言词汇触发器

Low-Resource Text Classification via Cross-lingual Language Model Fine-tuning

摘要

著录项

相似文献

相关主题

期刊订阅