首页> 外文会议>International conference on intelligent text processing and computational linguistics >Mining Parallel Resources for Machine Translation from Comparable Corpora
【24h】

Mining Parallel Resources for Machine Translation from Comparable Corpora

机译:从可比语料库挖掘并行资源进行机器翻译

获取原文

摘要

Good performance of Statistical Machine Translation (SMT) is usually achieved with huge parallel bilingual training corpora, because the translations of words or phrases are computed basing on bilingual data. However, in case of low-resource language pairs such as English-Bengali, the performance is affected by insufficient amount of bilingual training data. Recently, comparable corpora became widely considered as valuable resources for machine translation. Though very few cases of sub-sentential level parallelism are found between two comparable documents, there are still potential parallel phrases in comparable corpora. Mining parallel data from comparable corpora is a promising approach to collect more parallel training data for SMT. In this paper, we propose an automatic alignment of English-Bengali comparable sentences from comparable documents. We use a novel textual entailment method and distributional semantics for text similarity. Subsequently, we apply template-based phrase extraction technique to aligned parallel phrases from comparable sentence pairs. The effectiveness of our approach is demonstrated by using parallel phrases as additional training examples for an English-Bengali phrase-based SMT system. Our system achieves significant improvement in terms of translation quality over the baseline system.
机译:统计机器翻译(SMT)的良好性能通常通过巨大的平行双语培训语料库实现,因为单词或短语的翻译是基于双语数据的基础。但是,如果是英国孟加拉等低资源语言对,性能受到双语训练数据量不足的影响。最近,可比的Corpora被广泛被认为是机器翻译的宝贵资源。虽然在两个可比较的文档之间发现了很少有子信级并行性,但是在同类语料库中仍有潜在的并行短语。来自可比较Corpora的挖掘并行数据是一个有希望的方法,可以为SMT收集更多并行培训数据。在本文中,我们提出了从可比文件中自动对齐的英国孟加拉可比句子。我们使用一种新颖的文本鉴别方法和分布语义进行文本相似性。随后,我们应用基于模板的短语提取技术,以从可比较的句子对对齐并行短语。通过使用并行短语作为基于英国孟加拉语短语的SMT系统的额外训练示例,证明了我们方法的有效性。我们的系统在基线系统上的翻译质量方面取得了重大改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号