首页> 外文会议>Annual conference of the North American Chapter of the Association for Computational Linguistics: human language technologies >A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts
【24h】

A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts

机译:多语言噪声文本中语言义语言模型的比较和盗版语言识别

获取原文

摘要

This paper seeks to examine the effect of including background knowledge in the form of character pre-trained neural language model (LM), and data bootstrapping to overcome the problem of unbalanced limited resources. As a test, we explore the task of language identification in mixed-language short non-edited texts with an under-resourced language, namely the case of Algerian Arabic for which both labelled and unlabelled data are limited. We compare the performance of two traditional machine learning methods and a deep neural networks (DNNs) model. The results show that overall DNNs perform better on labelled data for the majority categories and struggle with the minority ones. While the effect of the untokenised and unlabelled data encoded as LM differs for each category, bootstrapping, however, improves the performance of all systems and all categories. These methods are language independent and could be generalised to other under-resourced languages for which a small labelled data and a larger unlabelled data are available.
机译:本文探讨了以特征预先训练的神经语言模型(LM)形式的背景知识包括背景知识,以及克服有限资源不平衡问题的数据引导。作为测试,我们探讨了用资源不足的语言探讨了语言识别语言识别的任务,即标记和未标记数据的阿尔及利亚阿拉伯语的情况有限。我们比较两个传统机器学习方法的性能和深度神经网络(DNN)模型。结果表明,整体DNN对大多数类别的标记数据表现更好,并与少数群体斗争。虽然被编码为LM的未驾驶和未标记的数据的效果对于每个类别而异,但是,引导映射可提高所有系统和所有类别的性能。这些方法是语言独立的,可以推广到其他欠资源的欠资源语言,其中有一个小标记数据和更大的未标记数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号