首页> 外文会议>Adaptation of language resources and tools for closely related languages and language varianta 2013 >Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation (invited talk)
【24h】

Combining, Adapting and Reusing Bi-texts between Related Languages: Application to Statistical Machine Translation (invited talk)

机译:在相关语言之间组合,改编和重用双文本:在统计机器翻译中的应用(特邀演讲)

获取原文
获取原文并翻译 | 示例

摘要

Bilingual sentence-aligned parallel corpora, or bi-texts, are a useful resource for solving many computational linguistics problems including part-of-speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts. We explore various options for bi-text reuse: (ⅰ) direct combination of bi-texts, (ⅱ) combination of models trained on such bi-texts, and (ⅲ) a sophisticated combination of (ⅰ) and (ⅱ). We further explore the idea of generating bi-texts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder. Finally, we observe that for closely-related languages, many of the differences are at the sub-word level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.
机译:双语句子对齐的并行语料库或双向文本是解决许多计算语言学问题的有用资源,包括词性标注,句法分析,命名实体识别,词义消歧,情感分析等;对于某些实际应用,例如统计机器翻译(SMT)和跨语言信息检索,它们也是至关重要的资源。不幸的是,构建大型的双语文本很困难,因此,在6,500多种世界语言中,大多数仍然缺乏双语资源。但是,许多资源匮乏的语言与某些资源丰富的语言有关,它们在词汇上重叠并共享认知,这为使用其双语文本提供了机会。我们探索了多种用于文本重用的选项:(ⅰ)文本的直接组合,(ⅱ)在这样的文本上训练的模型的组合,以及(ⅲ)(ⅰ)和(ⅱ)的复杂组合。我们进一步探讨了通过为资源丰富的语言改编双向文本来为资源贫乏的语言生成双向文本的想法。我们为每个单词和短语构建一个适应选项的格子,然后使用资源贫乏的语言的语言模型对其进行解码。我们比较了单词和短语级别的适应性,并进一步利用了跨语言形态。为了进行自适应,我们尝试了(a)基于标准短语的SMT解码器和(b)专用波束搜索自适应解码器。最后,我们观察到,对于紧密相关的语言,许多区别都在子词级别。因此,我们探索了将翻译减少到字符级音译的想法。我们进一步展示了结合单词和字符级别模型的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号