首页> 外文期刊>Machine translation >The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation
【24h】

The impact of Arabic morphological segmentation on broad-coverage English-to-Arabic statistical machine translation

机译:阿拉伯语形态分割对广泛覆盖的英语到阿拉伯语统计机器翻译的影响

获取原文
获取原文并翻译 | 示例
           

摘要

Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broadcoverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance. Our results show a difference of 2.31 BLEU points averaged over all test sets between the best and worst segmentation schemes indicating that the choice of the segmentation scheme has a significant effect on the performance of an English-to-Arabic PBSMT system in a large data scenario. We show that a simple segmentation scheme can perform as well as the best and more complicated segmentation scheme. An in-depth analysis on the effect of segmentation choices on the components of a PBSMT system reveals that text fragmentation has a negative effect on the perplexity of the language models and that aggressive segmentation can significantly increase the size of the phrase table and the uncertainty in choosing the candidate translation phrases during decoding. An investigation conducted on the output of the different systems, reveals the complementary nature of the output and the great potential in combining them.
机译:形态丰富的语言给统计机器翻译(SMT)带来了挑战。当翻译成形态丰富的语言时,这一挑战会更加严重。在这项工作中,我们将在以英语为阿拉伯语的广泛短语为基础的统计机器翻译(PBSMT)的框架下应对这一挑战。我们探索了最大范围的阿拉伯语分割方案集,从全字词形式到完整分割词形式,并研究了对系统性能的影响。我们的结果表明,最佳和最差的细分方案在所有测试集上平均有2.31个BLEU点的差异,表明在大数据场景中,细分方案的选择对英语到阿拉伯语PBSMT系统的性能有重大影响。我们表明,简单的分割方案可以执行最佳和更复杂的分割方案。对分段选择对PBSMT系统组件的影响进行的深入分析表明,文本分段对语言模型的困惑性具有负面影响,而积极的分段会显着增加短语表的大小和不确定性。在解码期间选择候选翻译短语。对不同系统的输出进行的一项调查揭示了输出的互补性以及将它们组合起来的巨大潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号