首页> 外文期刊>Machine translation >Labeling hierarchical phrase-based models without linguistic resources
【24h】

Labeling hierarchical phrase-based models without linguistic resources

机译:在没有语言资源的情况下标记基于层次短语的模型

获取原文
获取原文并翻译 | 示例
           

摘要

Long-range word order differences are a well-known problem for machine translation. Unlike the standard phrase-based models which work with sequential and local phrase reordering, the hierarchical phrase-based model (Hiero) embeds the reordering of phrases within pairs of lexicalized context-free rules. This allows the model to handle long range reordering recursively. However, the Hiero grammar works with a single nonterminal label, which means that the rules are combined together into derivations independently and without reference to context outside the rules themselves. Follow-up work explored remedies involving nonterminal labels obtained from monolingual parsers and taggers. As of yet, no labeling mechanisms exist for the many languages for which there are no good quality parsers or taggers. In this paper we contribute a novel approach for acquiring reordering labels for Hiero grammars directly from the word-aligned parallel training corpus, without use of any taggers or parsers. The new labels represent types of alignment patterns in which a phrase pair is embedded within larger phrase pairs. In order to obtain alignment patterns that generalize well, we propose to decompose word alignments into trees over phrase pairs. Beside this labeling approach, we contribute coarse and sparse features for learning soft, weighted label-substitution as opposed to standard substitution. We report extensive experiments comparing our model to two baselines: Hiero and the known syntax augmented machine translation (SAMT) variant, which labels Hiero rules with nonterminals extracted from monolingual syntactic parses. We also test a simplified labeling scheme based on inversion transduction grammar (ITG). For the Chinese-English task we obtain performance improvement up to 1 BLEU point, whereas for the German-English task, where morphology is an issue, a minor (but statistically significant) improvement of 0.2 BLEU points is reported over SAMT. While ITG labeling does give a performance improvement, it remains sometimes suboptimal relative to our proposed labeling scheme.
机译:远程字序差异是机器翻译的一个众所周知的问题。与标准的基于短语的模型可用于顺序和局部短语重新排序不同,基于分层短语的模型(Hiero)将短语的重新排序嵌入成对词汇化的上下文无关规则中。这使模型可以递归地处理远程重排序。但是,Hiero语法使用单个非终结符标签,这意味着规则被独立地组合到派生中,而无需引用规则本身之外的上下文。后续工作探讨了涉及从单语解析器和标记器获得的非终端标签的补救措施。到目前为止,对于许多语言,对于高质量的解析器或标记器,还没有标签机制。在本文中,我们提供了一种新颖的方法,可直接从字对齐的并行训练语料库中获取Hiero语法的重排序标签,而无需使用任何标记器或解析器。新标签表示对齐模式的类型,其中短语对嵌入较大的短语对内。为了获得可以很好地概括的对齐方式,我们建议将单词对齐方式分解为短语对上的树。除了这种标记方法外,我们还提供了粗糙和稀疏的功能来学习软的,加权的标签替换,而不是标准替换。我们报告了广泛的实验,将我们的模型与两个基线进行了比较:Hiero和已知的语法增强机器翻译(SAMT)变体,该变体使用从单语语法分析中提取的非终结符来标记Hiero规则。我们还测试了基于反演转导语法(ITG)的简化标记方案。对于中文-英语任务,我们将性能提高了1个BLEU点,而对于德语-英语任务,在形态上是一个问题,据报道,与SAMT相比,有0.2个BLEU点的轻微改善(但有统计学意义)。尽管ITG标记确实可以改善性能,但相对于我们提出的标记方案,它有时仍不理想。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号