首页> 外文会议>Association for Computational Linguistics Annual Meeting >Customizing Parallel Corpora at the Document Level
【24h】

Customizing Parallel Corpora at the Document Level

机译:在文档级别自定义并行语料库

获取原文

摘要

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performance drop caused by a good but mismatched resource, and even improving CLIR effectiveness by 4-7% when compared to using all available training data.
机译:最近的跨语言信息检索(CLIR)的研究建立了正确匹配用于对目标语料库的并行语料库匹配。 我们提出了一种解决这个问题的文档级方法:通过自动将其从其他平行语料库中获取的文档组装来构建定制的并行语料库。 虽然一般思想可以应用于使用并行对象的任何应用程序,但我们在医疗领域的CLIR呈现结果。 为了从多个并行语料库中提取最佳匹配的文档,我们通过使用它们与目标语料库之间的长度标准化的基于OKAPI的相似性分数来提出排名各个文档。 此排名允许我们丢弃50-90%的培训数据,同时避免良好但不匹配的资源引起的性能下降,与使用所有可用的培训数据相比,甚至在4-7%提高CLIR效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号