...
首页> 外文期刊>Natural language engineering >Segmentation and alignment of parallel text for statistical machine translation
【24h】

Segmentation and alignment of parallel text for statistical machine translation

机译:并行文本的分割和对齐以进行统计机器翻译

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.
机译:我们解决了从并行文本中提取双语块对以创建用于统计机器翻译的训练集的问题。我们根据文本翻译对上的随机生成过程来表达问题,并基于基础的对齐模型得出两个不同的对齐过程。第一个过程是现在标准的动态编程对齐模型,我们使用它来生成并行文本的初始粗略对齐。第二个过程是除法聚类并行文本对齐过程,我们使用它来完善首遍对齐。后一种过程是新颖的,因为它允许将并行文本分段成子句单元,这些子句单元可以被重新排序以改善块对齐。块对的质量通过从它们训练而来的机器翻译系统的性能来衡量。我们展示了分割群集的实际好处,以及如何利用并行文本的某些部分来提高系统性能,否则这些部分将不得不丢弃。我们还显示,块对齐是单词对齐的第一步,可以显着降低单词对齐错误率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号