首页> 外文会议>Conference on machine translation >Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task
【24h】

Measuring sentence parallelism using Mahalanobis distances: The NRC unsupervised submissions to the WMT18 Parallel Corpus Filtering shared task

机译:使用Mahalanobis距离测量句子的并行度:NRC对WMT18并行语料库筛选共享任务的无监督提交

获取原文

摘要

The WMT18 shared task on parallel corpus filtering (Koehn et al., 2018b) challenged teams to score sentence pairs from a large high-recall, low-precision web-scraped parallel corpus (Koehn et al., 2018a). Participants could use existing sample corpora (e.g. past WMT data) as a supervisory signal to learn what a "clean" corpus looks like. However, in lower-resource situations it often happens that the target corpus of the language is the only sample of parallel text in that language. We therefore made several unsupervised entries, setting ourselves an additional constraint that we not utilize the additional clean parallel corpora. One such entry fairly consistently scored in the top ten systems in the 100M-word conditions, and for one task-translating the European Medicines Agency corpus (Tiedemann, 2009)-scored among the best systems even in the 10M-word conditions.
机译:WMT18在并行语料库过滤方面的共同任务(Koehn等人,2018b)要求团队从大型的高召回率,低精度的网络抓取的并行语料库中为句子对打分(Koehn等人,2018a)。参与者可以使用现有的样本语料库(例如,过去的WMT数据)作为监督信号,以了解“干净的”语料库的外观。但是,在资源较少的情况下,经常会发生该语言的目标语料库是该语言中并行文本的唯一示例的情况。因此,我们进行了几次无人监督的输入,为自己设置了一个额外的约束,即我们没有利用额外的干净并行语料库。在100M个单词的条件下,其中一个这样的条目就一直在前十名系统中始终保持得分,而一项任务转换(即使在1000万个单词的条件下,欧洲药品管理局的语料库也被评为最佳系统之一)(Tiedemann,2009年)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号