【24h】

Subword-Level Language Identification for Intra-Word Code-Switching

机译:词内代码转换的子词级语言识别

获取原文

摘要

Language identification for code-switching (CS). the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new Spanish-Wixarika dataset and on an adapted German-Turkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines.
机译:用于代码转换(CS)的语言标识。传统上,在每个令牌使用一种语言的假设下,探讨了会话中两种或更多种语言之间交替出现的现象。但是,如果至少一种语言在形态上很丰富,则大量的单词可以由一种以上的语言(词内CS)组成。在本文中,我们将语言识别任务扩展到了子词级别,这样它就包括在混合每个词的同时用语言ID标记各个词的过程。我们进一步为该任务提出了一个基于分段递归神经网络的模型。在新的Spanish-Wixarika数据集和改编的German-Turkish数据集上的实验中,我们提出的模型的性能分别略好于或近似于我们的最佳基准。但是,仅考虑混合词,它的性能大大优于所有基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号