首页> 外文学位 >Lexical features for statistical machine translation.
【24h】

Lexical features for statistical machine translation.

机译:统计机器翻译的词汇功能。

获取原文
获取原文并翻译 | 示例

摘要

In modern phrasal and hierarchical statistical machine translation systems, two major features model translation: rule translation probabilities and lexical smoothing scores. The rule translation probabilities are computed as maximum likelihood estimates (MLEs) of an entire source (or target) phrase translating to a target (or source) phrase. The lexical smoothing scores are also a likelihood estimate of a source (target) phrase translating to a target (source) phrase, but they are computed using independent word-to-word translation probabilities. Intuitively, it would seem that the lexical smoothing score is a less powerful estimate of translation likelihood due to this independence assumption, but I present the somewhat surprising result that lexical smoothing is far more important to the quality of a state-of-the-art hierarchical SMT system than rule translation probabilities. I posit that this is due to a fundamental data sparsity problem: The average word-to-word translation is seen many more times than the average phrase-to-phrase translation, so the word-to-word translation probabilities (or lexical probabilities) are far better estimated.;Motivated by this result, I present a number of novel methods for modifying the lexical probabilities to improve the quality of our MT output. First, I examine two methods of lexical probability biasing, where for each test document, a set of secondary lexical probabilities are extracted and interpolated with the primary lexical probability distribution. Biasing each document with the probabilities extracted from its own first-pass decoding output provides a small but consistent gain of about 0.4 BLEU.;Second, I contextualize the lexical probabilities by factoring in additional information such as the previous or next word. The key to the success of this context-dependent lexical smoothing is a backoff model, where our "trust" of a context-dependent probability estimation is directly proportional to how many times it was seen in the training. In this way, I avoid the estimation problem seen in translation rules, where the amount of context is high but the probability estimation is inaccurate. When using the surrounding words as context, this feature provides a gain of about 0.6 BLEU on Arabic and Chinese.;Finally, I describe several types of discriminatively trained lexical features, along with a new optimization procedure called Expected-BLEU optimization. This new optimization procedure is able to robustly estimate weights for thousands of decoding features, which can in effect discriminatively optimize a set of lexical probabilities to maximize BLEU. I also describe two other discriminative feature types, one of which is the part-of-speech analogue to lexical probabilities, and the other of which estimates training corpus weights based on lexical translations. The discriminative features produce a gain of 0.8 BLEU on Arabic and 0.4 BLEU on Chinese.
机译:在现代短语和分级统计机器翻译系统中,模型翻译有两个主要功能:规则翻译概率和词汇平滑分数。规则转换概率被计算为翻译成目标(或源)短语的整个源(或目标)短语的最大似然估计(MLE)。词汇平滑分数也是翻译成目标(源)短语的源(目标)短语的似然估计,但是它们是使用独立的词对词翻译概率来计算的。凭直觉来看,由于这种独立性假设,词汇平滑分数似乎对翻译可能性的影响较小,但我提出了令人惊讶的结果,即词汇平滑对最新技术的质量更为重要分级SMT系统比规则转换概率大。我认为这是由于基本的数据稀疏性问题造成的:平均单词到单词的翻译比平均短语到短语的翻译要多得多,因此单词到单词的翻译概率(或词汇概率)受此结果的启发,我提出了许多新颖的方法来修改词法概率,以提高MT输出的质量。首先,我研究了两种词汇概率偏向方法,其中对于每个测试文档,提取一组次要词汇概率,并用主要词汇概率分布进行内插。用从其自身的第一遍解码输出中提取的概率对每个文档进行偏置,可以得到约0.4个BLEU的小而一致的增益。其次,我通过考虑诸如上一个或下一个单词之类的附加信息来对词汇概率进行上下文化。这种依赖于上下文的词汇平滑成功的关键是一个退避模型,其中我们对上下文依赖的概率估计的“信任”与在训练中看到的次数成正比。这样,我避免了在翻译规则中看到的估计问题,在翻译规则中,上下文的数量很大,但概率估计却不准确。当使用周围的单词作为上下文时,此功能为阿拉伯语和中文提供约0.6 BLEU的增益。最后,我描述了几种类型的经过判别训练的词法功能,以及一个称为Expected-BLEU优化的新优化过程。这种新的优化过程能够针对数千种解码功能稳健地估计权重,从而可以有区别地优化一组词汇概率以最大化BLEU。我还描述了其他两种区分特征类型,其中一种是词法概率的词性类似物,另一种是根据词法翻译估计训练语料的权重。区别特征使阿拉伯语的收益增加了0.8 BLEU,中文增加了0.4 BLEU。

著录项

  • 作者

    Devlin, Jacob.;

  • 作者单位

    University of Maryland, College Park.;

  • 授予单位 University of Maryland, College Park.;
  • 学科 Computer Science.
  • 学位 M.S.
  • 年度 2009
  • 页码 91 p.
  • 总页数 91
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号