首页> 外文期刊>Computer science journal of Moldova >Wiki-Translator: Multilingual Experiments for In-Domain Translations
【24h】

Wiki-Translator: Multilingual Experiments for In-Domain Translations

机译:Wiki-Translator:现场翻译的多语言实验

获取原文
获取原文并翻译 | 示例
       

摘要

The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-doniain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, is based on data extracted from comparable corpora from the new domain or from an as close as possible one. This article reports on a slightly different approach: building an SMT system entirely from comparable data for the domain of interest. Certainly, the approach is feasible if the comparable corpora are large enough to extract SMT useful data in sufficient quantities for a reliable training. The more comparable corpora, the better the results are. Wikipedia is definitely a very good candidate for such an experiment. We report on mass experiments showing significant improvements over a baseline system built from highly similar (almost parallel) text fragments extracted from Wikipedia. The improvements, statistically significant, are related to what we call the level of transla-tional similarity between extracted pairs of sentences. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on sentence pairs extracted from the entire dumps of Wikipedia as of December 2012. Our experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.
机译:各种研究人员已经证明了使用可比语料库来提高统计机器翻译的翻译质量的好处。通常的方法是从基线系统开始,对基础系统进行训练,并使其适应需要新翻译的领域。对新域的适应,特别是对狭窄域的适应,是基于从新域或尽可能近的同等语料库中提取的数据。本文报告了一种略有不同的方法:完全根据感兴趣领域的可比数据构建S​​MT系统。当然,如果可比语料库足够大以提取足够数量的SMT有用数据以进行可靠的训练,则该方法是可行的。可比语料越多,结果越好。维基百科绝对是此类实验的很好的候选人。我们报告了大量实验,这些实验表明,从从Wikipedia提取的高度相似(几乎平行)的文本片段构建的基准系统上,存在重大改进。统计学上显着的改进与我们所提取的句子对之间的翻译相似度有关。根据截至2012年12月从整个Wikipedia转储中提取的句子对,对三种语言对进行了实验:西班牙语-英语,德语-英语和罗马尼亚语-英语。我们的实验以及与类似作品的比较表明,添加了不加选择的更多数据在SMT中,训练语料库不一定是一件好事。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号