首页> 外文会议>Second workshop on subword and character level models in NLP 2018 >A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts
【24h】

A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts

机译:字符神经语言模型和自举技术在多语言嘈杂文本识别中的比较

获取原文
获取原文并翻译 | 示例

摘要

This paper seeks to examine the effect of including background knowledge in the form of character pre-trained neural language model (LM), and data bootstrapping to overcome the problem of unbalanced limited resources. As a test, we explore the task of language identification in mixed-language short non-edited texts with an under-resourced language, namely the case of Algerian Arabic for which both labelled and unlabelled data are limited. We compare the performance of two traditional machine learning methods and a deep neural networks (DNNs) model. The results show that overall DNNs perform better on labelled data for the majority categories and struggle with the minority ones. While the effect of the untokenised and unlabelled data encoded as LM differs for each category, bootstrapping, however, improves the performance of all systems and all categories. These methods are language independent and could be generalised to other under-resourced languages for which a small labelled data and a larger unlabelled data are available.
机译:本文旨在研究以字符预训练的神经语言模型(LM)形式包含背景知识以及数据引导的效果,以克服资源不平衡的问题。作为测试,我们探讨了在资源匮乏的混合语言简短非编辑文本中进行语言识别的任务,即标记和未标记数据都受到限制的阿尔及利亚阿拉伯语的情况。我们比较了两种传统机器学习方法和深度神经网络(DNN)模型的性能。结果表明,总体DNN在多数类别的标记数据上表现更好,并且与少数类别抗争。尽管对于每个类别,编码为LM的未标记和未标记的数据的效果会有所不同,但是自举可以提高所有系统和所有类别的性能。这些方法与语言无关,可以推广到资源不足的其他语言,这些语言可以使用较小的标记数据和较大的未标记数据。

著录项

  • 来源
  • 会议地点 New Orleans(US)
  • 作者单位

    Department of Philosophy, Linguistics and Theory of Science (FLoV), Centre for Linguistic Theory and Studies in Probability (CLASP), University of Gothenburg;

    Department of Philosophy, Linguistics and Theory of Science (FLoV), Centre for Linguistic Theory and Studies in Probability (CLASP), University of Gothenburg;

    Department of Philosophy, Linguistics and Theory of Science (FLoV), Centre for Linguistic Theory and Studies in Probability (CLASP), University of Gothenburg;

    CEA, LIST, Vision and Content Engineering Laboratory Gif-sur-Yvette, France;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号