【24h】

Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

机译:在计算语言文档的双语方法中调查语言影响

获取原文

摘要

For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models' input representation increases their translation and alignment quality, specially for challenging language pairs.
机译:对于濒临灭绝的语言,数据收集运动必须应对许多语言都来自口头传统的挑战,并且产生转录本的成本很高。因此,将它们翻译成广泛使用的语言以确保录音的可解释性是至关重要的。在本文中,我们研究了翻译语言的选择如何影响后验文档工作以及潜在的自动方法,这些方法将在产生的双语语料库上起作用。为了回答这个问题,我们使用MaSS多语言语音语料库(Boito等人,2020)创建了56对双语对,这些对适用于低资源无监督单词分割和对齐的任务。我们的结果表明,翻译语言的选择会影响分词性能,并且通过使用不同的对齐翻译可以学习不同的词典。最后,本文提出了一种混合的双语分词方法,将非参数贝叶斯模型(Goldwater等,2009a)中提取的边界线索与Godard等人的注意分词神经模型相结合。 (2018)。我们的结果表明,将这些线索整合到神经模型的输入表示中可以提高其翻译和对齐质量,特别是对于具有挑战性的语言对而言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号