【24h】

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

机译:CCMATrix:网上挖掘数十亿的高质量平行句子

获取原文

摘要

We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 Common Crawl snapshots (Wenzek et al., 2019), totalling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT' 19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT'19 systems, which train on the WMT training data and augment it with backtrans-lation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.
机译:我们展示了在多语言句子空间中基于边缘的BITEXT挖掘可以成功扩展,以在数十亿句话中运行。我们使用32个常见的爬网快照(Wenzek等,2019),总计710亿个独特的句子。使用一个统一的方法进行90种语言,我们能够挖掘108亿辆并行句子,其中仅2.9亿美元与英语对齐。我们通过培训数百种不同的机器翻译模型并在多对多TED基准上评估它们,从一种语言创建从一种语言创建高质量培训集的能力。此外,我们评估了WMT和Wat等竞争性翻译基准。仅使用Mined BITEXT,我们为英语 - 德语/俄语/汉语/汉语/汉语/汉语/汉语的WMT'19测试集的单个系统设置了新的最新状态。特别是,我们的英语/德语和英语/俄罗斯系统以超过4个BLEU积分优于最佳单个,并与最佳WMT'19系统相提并论,该系统在WMT培训数据上列车并使用反射讲座增强。我们还为俄语/日语等遥感语言的良好结果取得了优异的效果,优于2020 Wat车间的最佳提交。所有矿物的BITEXT都可以自由使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号