首页> 外文会议>Conference on machine translation >NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
【24h】

NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

机译:NICT用于WMT18并行语料库筛选任务的语料库筛选系统

获取原文

摘要

This paper presents the NICT's participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.
机译:本文介绍了NICT在WMT18共享并行语料库过滤任务中的参与情况。作为Paracrawl项目的一部分,组织者提供了10亿个德语-英语语料从网络上爬取的功能。该语料库太嘈杂,无法构建可接受的神经机器翻译(NMT)系统。使用WMT18共享新闻翻译任务的干净数据,我们设计了几个功能并训练了分类器,以对嘈杂数据中的每个句子对进行评分。最后,我们采样了1亿个和1000万个单词,并构建了相应的NMT系统。实证结果表明,我们的NMT系统在采样数据上得到了训练,其性能令人鼓舞。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号