首页> 外文期刊>ACM transactions on Asian language information processing >Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems
【24h】

Pause-Based Phrase Extraction and Effective OOV Handling for Low-Resource Machine Translation Systems

机译:低资源机器翻译系统基于暂停的短语提取和有效的OOV处理

获取原文
获取原文并翻译 | 示例

摘要

Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). This work proposes and studies the effect of a phrase-induced hybrid machine translation system for translation from English to Tamil, under a low-resource setting. Unlike conventional hybrid MT systems, the free-word ordering feature of the target language Tamil is exploited to form a re-ordered target language model and to extend the parallel text corpus for training the SMT. In the current work, a novel rule-based phrase-extraction method, implemented using parts-of-speech (POS) and place-of-pause in both languages is proposed, which is used to pre-process the training corpus for developing the back-off phrase-induced SMT. Further, out-of-vocabulary (OOV) words are handled using speech-based transliteration and two-level thesaurus intersection techniques based on the POS tag of the OOV word. To ensure that the input with OOV words does not skip phrase-level translation in the hierarchical model, a phrase-level example-based machine translation approach is adopted to find the closest matching phrase and perform translation followed by OOV replacement. The proposed system results in a bilingual evaluation understudy score of 84.78 and a translation edit rate of 19.12. The performance of the system is compared in terms of adequacy and fluency, with existing translation systems for this specific language pair, and it is observed that the proposed system outperforms its counterparts.
机译:机器翻译是全球几种自然语言处理研究的核心问题。但是,构建涉及资源少的语言的翻译系统对于统计机器翻译(SMT)仍然是一个挑战。这项工作提出并研究了在资源匮乏的情况下,短语诱导的混合机器翻译系统从英语到泰米尔语翻译的效果。与传统的混合MT系统不同,目标语言Tamil的自由词排序功能被利用来形成重新排序的目标语言模型,并扩展并行文本语料库以训练SMT。在当前的工作中,提出了一种新颖的基于规则的短语提取方法,该方法使用两种语言的词性(POS)和暂停位置实现,用于预处理训练语料库以开发语言退避短语诱导的SMT。此外,基于OOV词的POS标签,使用基于语音的音译和两级词库交集技术来处理词汇外(OOV)词。为了确保具有OOV单词的输入不会在分层模型中跳过短语级别的翻译,采用基于短语级别示例的机器翻译方法来查找最匹配的短语并执行翻译,然后进行OOV替换。该系统的双语评估学习成绩为84.78,翻译编辑率为19.12。对该系统的性能进行了比较,并与现有的针对该特定语言对的翻译系统进行了比较,发现该系统的性能优于同类系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号