【24h】

Customizing Parallel Corpora at the Document Level

机译:在文档级别自定义并行语料库

获取原文

摘要

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performance drop caused by a good but mismatched resource, and even improving CLIR effectiveness by 4-7% when compared to using all available training data.
机译:跨语言信息检索(CLIR)的最新研究表明,需要将用于查询翻译的并行语料正确匹配到目标语料。我们提出了一种文档级方法来解决此问题:通过从其他并行语料库中获取的文档自动组装来构建定制的并行语料库。尽管一般思想可以应用于使用并行语料库的任何应用程序,但是我们在医学领域提供了CLIR的结果。为了从几个并行语料库中提取最匹配的文档,我们建议通过使用它们与目标语料库之间基于长度归一化的基于Okapi的相似度评分对单个文档进行排名。该排名使我们可以丢弃50-90%的训练数据,同时避免了由于良好但不匹配的资源而导致的性能下降,甚至与使用所有可用的训练数据相比,CLIR效果甚至提高了4-7%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号