首页> 外文期刊>Journal of Language Modelling >Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages
【24h】

Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages

机译:学习跨语言的语音和拼字法适应:改进低资源语言之间的神经机器翻译的案例研究

获取原文
       

摘要

Out-of-vocabulary (OOV) words can pose serious challenges for machine translation (MT) tasks, and in particular, for low-resource language (LRL) pairs, i.e., language pairs for which few or no parallel corpora exist. Our work adapts variants of seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built from a bilingual dictionary of Hindi - Bhojpuri words. We demonstrate that our models can be effectively used for language pairs that have limited parallel corpora; our models work at the character level to grasp phonetic and orthographic similarities across multiple types of word adaptations, whether synchronic or diachronic, loan words or cognates. We describe the training aspects of several character level NMT systems that we adapted to this task and characterize their typical errors. Our method improves BLEU score by 6.3 on the Hindi-to-Bhojpuri translation task. Further, we show that such transductions can generalize well to other languages by applying it successfully to Hindi - Bangla cognate pairs. Our work can be seen as an important step in the process of: (i) resolving the OOV words problem arising in MT tasks; (ii) creating effective parallel corpora for resource constrained languages; and (iii) leveraging the enhanced semantic knowledge captured by word-level embeddings to perform character-level tasks.
机译:词汇外(OOV)单词可能对机器翻译(MT)任务尤其是低资源语言(LRL)对(即,很少或没有并行语料库的语言对)构成严峻挑战。我们的工作改编了seq2seq模型的变体,以从印地语到Bhojpuri(一个LRL实例)进行此类单词的转换,并从根据印地语-Bhojpuri单词的双语词典构建的一组同源对中学习。我们证明了我们的模型可以有效地用于并行语料库有限的语言对。我们的模型在字符级别工作,以掌握多种类型的单词改编的语音和正字相似性,无论是共时的还是历时的,借来的单词或同源单词。我们描述了几种字符级NMT系统的训练方面,这些系统适合于此任务并描述了它们的典型错误。在Hindi-to-Bhojpuri翻译任务上,我们的方法将BLEU分数提高了6.3。此外,我们证明,通过成功地将其应用于印地语-孟加拉语同源对,这种转导可以很好地推广到其他语言。我们的工作可以看作是以下过程中的重要步骤:(i)解决MT任务中出现的OOV单词问题; (ii)为资源受限的语言创建有效的并行语料库; (iii)利用单词级嵌入捕获的增强语义知识来执行字符级任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号