首页> 外文会议>Computational Linguistics and Intelligent Text Processing >Statistical Machine Translation into a Morphologically Complex Language
【24h】

Statistical Machine Translation into a Morphologically Complex Language

机译:统计机器翻译成形态复杂的语言

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present the results of our investigation into phrase-based statistical machine translation from English into Turkish - an agglutinative language with very productive inflectional and derivational word-formation processes. We investigate different representational granularities for morphological structure and find that (ⅰ) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ⅱ) augmenting the training data with "sentences" comprising only the content words of the original training data to bias root word alignment, and with highly-reliable phrase-pairs from an earlier corpus-alignment (ⅲ) re-ranking the n-best morpheme-sequence outputs of the decoder with a word-based language model, and (ⅳ) "repairing" translated words with incorrect morphological structure and words which are out-of-vocabulary relative to the training and the language model corpus, provide an non-trivial improvement over a word-based baseline despite our very limited training data. We improve from 19.77 BLEU points for our word-based baseline model to 26.87 BLEU points for an improvement of 7.10 points or about 36% relative. We briefly discuss the applicability of BLEU to morphologically complex languages like Turkish and present a simple extension to compare tokens not in a all-or-none fashion but taking lexical-semantic and morpho-semantic similarities into account, implemented in our BLEU+ tool.
机译:在本文中,我们介绍了我们对基于短语的统计机器翻译(从英语到土耳其语)的调查结果,土耳其语是一种具有高产的屈折和衍生词形成过程的凝集性语言。我们研究了形态结构的不同表示粒度,发现(ⅰ)在语素水平上代表土耳其语和英语,但在训练数据的土耳其语方面具有一些选择性的语素分组,(ⅱ)用“句子”扩充训练数据”仅包含原始训练数据的内容词,以偏向词根对齐,并具有来自较早语料对齐(ⅲ)的高度可靠的短语对,将解码器的n个最佳语素序列输出重新排序,基于单词的语言模型,以及(ⅳ)具有不正确的词法结构的“修复”翻译单词以及相对于训练和语言模型语料库而言超出词汇范围的单词,在基于单词的基线上提供了重要的改进尽管我们的培训数据非常有限。我们从基于单词的基线模型的19.77 BLEU点提高到26.87 BLEU点,提高了7.10点,相对而言提高了36%。我们简要讨论了BLEU在诸如土耳其语之类的形态复杂的语言中的适用性,并提出了一个简单的扩展,以比较标记(不是全或全),而是考虑在我们的BLEU +工具中实现的词法语义和词法语义相似性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号