Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Hai-Long Trieu; Duc-Vu Tran; Ittoo Ashwin; Le-Minh Nguyen

首页> 外文期刊>ACM transactions on Asian language information processing >Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

【24h】

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

机译：利用其他资源来改善亚洲低资源语言的统计机器翻译

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Phrase-based machine translation (MT) systems require large bilingual corpora for training. Nevertheless, such large bilingual corpora are unavailable for most language pairs in the world, causing a bottleneck for the development of MT. For the Asian language pairs-Japanese, Indonesian, Malay paired with Vietnamese- they are also not excluded from the case, in which there are no large bilingual corpora on these low-resource language pairs. Furthermore, although the languages are widely used in the world, there is no prior work on MT, which causes an issue for the development of MT on these languages. In this article, we conducted an empirical study of leveraging additional resources to improve MT for the Asian low-resource language pairs: translation from Japanese, Indonesian, and Malay to Vietnamese. We propose an innovative approach that lies in two strategies of building bilingual corpora from comparable data and phrase pivot translation on existing bilingual corpora of the languages paired with English. Bilingual corpora were built from Wikipedia bilingual titles to enhance bilingual data for the low-resource languages. Additionally, we introduced a combined model of the additional resources to create an effective solution to improve MT on the Asian low-resource languages. Experimental results show the effectiveness of our systems with the improvement of +2 to +7 BLEU points. This work contributes to the development of MT on low-resource languages, especially opening a promising direction for the progress of MT on the Asian language pairs.

机译：基于短语的机器翻译（MT）系统需要大型双语语料库进行培训。然而，世界上大多数语言对都无法使用如此庞大的双语语料库，这导致MT的发展成为瓶颈。对于亚洲语言对（日语，印度尼西亚语，马来语与越南语配对），也没有排除在这种情况下，在这些资源匮乏的语言对上没有大型双语语料库。此外，尽管语言在世界范围内被广泛使用，但是尚无关于MT的先前工作，这为开发这些语言的MT带来了问题。在本文中，我们进行了一项利用附加资源来提高亚洲低资源语言对MT的实证研究：从日语，印尼语和马来语到越南语的翻译。我们提出了一种创新的方法，该方法基于两种策略，即从可比较的数据构建双语语料库，并在与英语配对的现有双语语料库上进行词组透视翻译。双语语料库是从Wikipedia双语标题构建的，以增强资源较少的语言的双语数据。此外，我们引入了附加资源的组合模型，以创建有效的解决方案来提高亚洲低资源语言的MT。实验结果表明，我们的系统具有+2到+7 BLEU点的提高效果。这项工作为开发低资源语言的MT做出了贡献，尤其为亚洲语言对MT的发展打开了一个有希望的方向。

著录项

来源
《ACM transactions on Asian language information processing》 |2019年第3期|32.1-32.22|共22页
作者
Hai-Long Trieu; Duc-Vu Tran; Ittoo Ashwin; Le-Minh Nguyen;
展开▼
作者单位

Japan Adv Inst Sci & Technol Sch Informat Sci Asahidai 1-1 Nomi Ishikawa Japan;

Univ Liege QUANTOM Ctr Quantitat Methods & Operat Management HEC Liege Rue Louvrex 14 B-4000 Liege Belgium;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Statistical machine translation; pivot methods; sentence alignment; semantic similarity; low-resource languages;

机译：统计机器翻译;枢轴方法;句子对齐语义相似度;资源贫乏的语言;

相似文献

外文文献
中文文献
专利

1. Extremely low-resource neural machine translation for Asian languages [J] . Rubino Raphael, Marie Benjamin, Dabre Raj, Machine translation . 2020,第4期

机译：极低资源的神经机用于亚洲语言翻译
2. Neighbors helping the poor: improving low-resource machine translation using related languages [J] . Nima Pourdamghani, Kevin Knight Machine translation . 2019,第3期

机译：邻居帮助穷人：使用相关语言改善低资源机器翻译
3. Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages [J] . Saurav Jha, Akhilesh Sudhakar, Anil Kumar Singh Journal of Language Modelling . 2019,第2期

机译：学习跨语言的语音和拼字法适应：改进低资源语言之间的神经机器翻译的案例研究
4. Neural-Based Machine Translation System Outperforming Statistical Phrase-Based Machine Translation for Low-Resource Languages [C] . Muskaan Singh, Ravinder Kumar, Inderveer Chana International Conference on Contemporary Computing . 2019

机译：低资源语言的基于神经的机器翻译系统胜过基于统计短语的机器翻译
5. Leveraging Training Data from High-Resource Languages to Improve Dependency Parsing for Low-Resource Languages [D] . Jaja, Claire. 2014

机译：利用来自高资源语言的培训数据来改善对低资源语言的依赖关系解析
6. Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation [O] . Michael Adjeisah, Guohua Liu, Douglas Omwenga Nyabuga, 2021

机译：神经电机翻译低资源语料的假义注射和预先滤波
7. Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages [O] . Saurav Jha, Akhilesh Sudhakar, Anil Kumar Singh 2019

机译：学习交叉语音语音和矫形矫正适应性：在改进低资源语言中神经机翻译的案例研究
8. Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages. [R] . Carbonell, J., Levin, L., Smith, N., 2017

机译：结构化翻译的语言核心方法与低资源语言分析。

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅