首页> 外文OA文献 >Hybrid data-driven models of machine translation
【2h】

Hybrid data-driven models of machine translation

机译:混合数据驱动的机器翻译模型

摘要

Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred.ududIn this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived.ududMaking use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English.ududOn a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality.ududFollowing on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a).
机译:基于语料库的机器翻译(MT)方法在当今的MT研究领域占据主导地位,基于示例的MT(EBMT)和统计MT(SMT)代表了数据驱动范例中的两个不同框架。 EBMT始终利用短语和词汇对应来产生高质量的翻译。另一方面,早期的SMT模型是基于单词级对应的,但是随着更复杂的基于短语的方法的出现,EBMT和SMT之间的界线变得越来越模糊。 ud ud大量翻译实验,将基于标记的最先进的EBMT系统Gough and Way(2004a,2004b),Way and Gough(2005)和Gough(2005)与基于短语的SMT(PBSMT)的性能进行了比较)系统使用最先进的基于PHARAOphHra se的解码器(Koehn,2004a)构建,并在信息学中采用标准的短语提取(Koehn等人,2003)。此外,还描述了一些实验,旨在研究将EBMT和SMT元素组合在一起以创建一种混合数据驱动MT的模型的能力,该模型能够胜过任何一种从中获取MT的方法。 ud ud使用从中获取的训练和测试数据Sun Microsystems计算机文档的French-Enghsh翻译记忆库,我们发现,与基于EBMT标记的亚句子对齐相比,使用GIZA ++基于单词和短语的数据播种PBSMT系统时,可以看到更好的结果,但是通常,当此“混合”数据的组合用于构建转换模型和概率模型。尽管在大多数情况下,基于基线标记的EBMT系统的性能优于在这些实验中构建的PBSbIT系统的任何风味,但结合GIZA ++和EBMT系统自动感应的数据集,可以得到一种混合系统,可以对EBMT系统本身进行改进 ud ud在不同的数据集上(取自Europarl语料库(Koehn,2005年)),我们进行了一些实验,分别使用78K,156K和322K句子对的增量训练数据大小。在此数据集上,我们表明,构建能够胜过基线EBMT系统的混合“统计EBMT”系统将获得类似的收益。这次,尽管EBMT系统的所有“混合”变体均达不到基线PBSMT系统所达到的质量,但像Sun Mzcrosystems实验中那样,将基于标记的数据和SMT数据的元素合并在一起,创建了一个混合“示例”基于SMT的系统,其性能优于基准SMT和EBMT系统。此外,我们通过将SMT目标语言模型添加到所有EBMT系统变体中,提供了支持混合数据驱动方法的进一步证据,并证明这也对翻译质量产生了积极影响。 ud ud提出了一种新的混合数据驱动MT架构,以及新颖的基于标记的解码器,该解码器改进了Gough and Way(2004a,2004b),Way and Gough(2005)和Gough(2005)的基于标记的EBMT系统的性能),并与最新的PHARAOH SMHT解码器相比(Koehn,2004a)。

著录项

  • 作者

    Groves Declan;

  • 作者单位
  • 年度 2007
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号