...
首页> 外文期刊>Computer speech and language >Novel textual features for language modeling of intra-sentential code-switching data
【24h】

Novel textual features for language modeling of intra-sentential code-switching data

机译:语言建模的新型文本特征 - 句子码切换数据的语言建模

获取原文
获取原文并翻译 | 示例
           

摘要

Code-switching refers to the frequent use of non-native language words/phrases by speakers while conversating in their native languages. Traditionally, for training a language model (LM) for code-switching data, one is required to tediously collect a large amount of text corpus in the respective code-switching domain. Alternately, we recently proposed a more viable approach that adapts an existing native LM to handle the code-switching data. In this work, we present our efforts for language modeling of code-switching data following both the traditional and the proposed approaches. The salient contributions of this paper includes: (ⅰ) creation of the Hindi-English code-switching text corpus, (ⅱ) an improved parts-of-speech (POS) labeling scheme for accurate tagging of non-native words embedded in the code-switching data, and (ⅲ) the proposal of a novel textual feature referred to as the code-switching location (CSL) feature, that allows LMs to predict the code-switching instances. The evaluation of the proposed features has been done on two code-switching datasets: Hindi-English and Mandarin-English. On experimental evaluation, a substantial reduction in the perplexity is achieved with the use of the improvised POS features. It is also observed that the proposed CSL features provide an independent and additive improvement over the POS features in terms of perplexity.
机译:代码切换是指扬声器频繁使用扬声器的非母语单词/短语,同时以其母语交谈。传统上,对于用于代码切换数据的语言模型(LM)来训练语言模型(LM),必须在相应的代码切换域中缩小收集大量文本语料库。或者,我们最近提出了一种更加可行的方法,它适应现有的本机LM来处理代码切换数据。在这项工作中,我们介绍了传统和建议方法后代码切换数据的语言建模的努力。本文的突出贡献包括:(Ⅰ)创建印地语 - 英语代码切换文本语料库,(Ⅱ)改进了语音(POS)标签方案,用于准确标记代码中的非原生单词-Switching数据,(Ⅲ)提出的新颖文本功能称为代码切换位置(CSL)功能,允许LMS预测代码切换实例。对拟议功能的评估已经在两个代码切换数据集中完成:Hindi-English和普通话 - 英语。在实验评价上,通过使用简易的POS特征,实现了困惑的显着降低。还观察到,所提出的CSL特征在困惑方面提供了对POS特征的独立和添加剂的改进。

著录项

  • 来源
    《Computer speech and language》 |2020年第11期|101099.1-101099.19|共19页
  • 作者单位

    Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Guwahati 781039 India;

    Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Guwahati 781039 India;

    Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati Guwahati 781039 India;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Code-switching; Textual features; Factored language modeling;

    机译:代码切换;文本特征;因素语言建模;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号