首页> 外文期刊>Computational linguistics >Improving machine translation performance by exploiting non-parallel corpora
【24h】

Improving machine translation performance by exploiting non-parallel corpora

机译:利用非并行语料库提高机器翻译性能

获取原文
获取原文并翻译 | 示例
           

摘要

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available.
机译:我们提出了一种在可比的,非平行语料库中发现平行句子的新颖方法。我们训练一个最大熵分类器,给定一对句子,它可以可靠地确定它们是否是彼此的翻译。使用这种方法,我们从大型中文,阿拉伯文和英文非平行报纸语料库中提取平行数据。通过显示提取的数据可以提高最新的统计机器翻译系统的性能,我们评估了提取数据的质量。我们还表明,可以通过从很小的并行语料库(100,000个单词)开始并利用较大的非并行语料库,从头开始构建高质量的MT系统。因此,我们的方法可以极大地应用于只有稀缺资源的语言对。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号