首页> 外文会议>International joint conference on natural language processing >The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English
【24h】

The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

机译:低资源机器的弗洛雷斯评估数据集翻译:尼泊尔 - 英语和僧伽罗语

获取原文

摘要

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the Flores evaluation datasets for Nepali-English and Sinhala-English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at https: //github.com/facebookresearch/flores.
机译:对于机器翻译,世界上绝大多数语言对被认为是低资源,因为它们具有很少的平行数据。除了学习的技术挑战外,由于缺乏自由和公开的基准,难以评估在低资源语言对上培训的方法。在这项工作中,根据从维基百科翻译的句子介绍了Nepali-English和Sinhala-English的Flores评估数据集。与英语相比,这些是具有非常不同的形态和语法的语言,对于哪个很少的域并行数据可用,并且可以自由地提供相对大量的单格式数据。我们描述了我们收集和交叉检查翻译质量的过程,我们通过多种学习设置报告基线绩效:全面监督,弱监督,半监督,完全无监督。我们的实验表明,目前最先进的方法在这项基准上表现不佳,对在低资源MT上工作的研究界构成挑战。重现我们的实验的数据和代码可在https://github.com/facebookResearch/flores上获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号