首页> 外文会议>Chinese National Conference on Computational Linguistic >Low-Resource Text Classification via Cross-lingual Language Model Fine-tuning
【24h】

Low-Resource Text Classification via Cross-lingual Language Model Fine-tuning

机译:通过跨语言模型微调的低资源文本分类

获取原文

摘要

Text classification tends to be difficult when data arc inadequate considering the amount of manually labeled text corpora. For low-resource agglutinative languages including Uyghur, Kazakh, and Kyrgyz (UKK languages), in which words are manufactured via stems concatenated with several suffixes and stems are used as the representation of text content, this feature allows infinite derivatives vocabulary that leads to high uncertainty of writing forms and huge redundant features. There are major challenges of low-resource agglutinative text classification the lack of labeled data in a target domain and morphologic diversity of derivations in language structures. It is an effective solution which fine-tuning a pre-trained language model to provide meaningful and favorable-to-use feature extractors for downstream text classification tasks. To this end, we propose a low-rcsourcc agglutinative language model fine-tuning AgglutiFiT, specifically, we build a low-noise fine-tuning dataset by morphological analysis and stem extraction, then fine-tune the cross-lingual pre-training model on this dataset. Moreover, we propose an attention-based fine-tuning strategy that better selects relevant semantic and syntactic information from the pre-trained language model and uses those features on downstream text classification tasks. We evaluate our methods on nine Uyghur, Kazakh, and Kyrgyz classification datasets, where they have significantly better performance compared with several strong baselines.
机译:当数据弧考虑手动标记的文本语料库数量不足时,文本分类往往很困难。对于包括UYGHUR,哈萨克和吉尔吉斯(UKK语言)的低资源凝聚语言,其中通过用几个后缀和茎串联制造的单词用作文本内容的表示,这一功能允许导致高的无限衍生物词汇表写作形式和巨大冗余功能的不确定性。低资源凝集文本分类存在重大挑战,在语言结构中缺乏标记数据和语言结构中的衍生形态的多样性。它是一个有效的解决方案,它微调预先训练的语言模型,以提供用于下游文本分类任务的有意义和有利的功能提取器。为此,我们提出了一个低RCSOUCCCLUTINATIVE语言模型的微调凝聚,具体而言,我们通过形态分析和干提取来构建低噪声微调数据集,然后微调交叉训练的预训练模型这个数据集。此外,我们提出了一种基于关注的微调策略,可以从预先训练的语言模型中选择相关的语义和句法信息,并使用下游文本分类任务的这些功能。我们在九所UYGHUR,哈萨克和吉尔吉斯分类数据集中评估我们的方法,与几个强大的基线相比,它们具有明显更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号