Customizing Parallel Corpora at the Document Level

机译：在文档级别自定义并行语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performance drop caused by a good but mismatched resource, and even improving CLIR effectiveness by 4-7% when compared to using all available training data.

机译：最近的跨语言信息检索（CLIR）的研究建立了正确匹配用于对目标语料库的并行语料库匹配。我们提出了一种解决这个问题的文档级方法：通过自动将其从其他平行语料库中获取的文档组装来构建定制的并行语料库。虽然一般思想可以应用于使用并行对象的任何应用程序，但我们在医疗领域的CLIR呈现结果。为了从多个并行语料库中提取最佳匹配的文档，我们通过使用它们与目标语料库之间的长度标准化的基于OKAPI的相似性分数来提出排名各个文档。此排名允许我们丢弃50-90％的培训数据，同时避免良好但不匹配的资源引起的性能下降，与使用所有可用的培训数据相比，甚至在4-7％提高CLIR效率。

著录项

来源
《Association for Computational Linguistics Annual Meeting》|2004年||共4页
会议地点
作者
Monica ROGATI; Yiming YANG; Association for Computational Linguistics(ACL)(US);
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词

相似文献

外文文献
中文文献
专利

1. Sentence Level Alignment of Digitized Books Parallel Corpora [J] . Laukaitis Algirdas, Plikynas Darius, Ostasius Egidijus Informatica . 2018,第4期

机译：数字书籍平行语料库的句级对齐
2. Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora [J] . Hermann Moisla* Journal of Quantitative Linguistics . 2011,第1期

机译：寻找最小文档长度以可靠地聚类多文档自然语言语料库
3. A scaleable document clustering approach for large document corpora [J] . Niall Rooney, David Patterson, Mykola Galushka, Information Processing & Management . 2006,第5期

机译：大型文档语料库的可扩展文档聚类方法
4. Customizing Parallel Corpora at the Document Level [C] . Monica ROGATI, Yiming YANG Proceedings of the Student Research Workshop, Interactive Posters/Demonstrations, and Tutorial Abstracts . 2004

机译：在文档级别自定义并行语料库
5. Parallel Corpora and Pedagogy: Enhancing Chinese Foreign Language Learning Experience Through Parallel Corpus Technology [D] . Bluemel, Brody T. 2015

机译：平行语料库和教育学：通过并行语料库技术提高中国外语学习体验
6. Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network [O] . Shaolin Zhu, Yong Yang, Chun Xu 2020

机译：使用并行分层注意网络从非平行语料库中提取并行句子
7. Customizing Parallel Corpora at the Document Level [O] . Monica Rogati, Yiming Yang 2009

机译：在文档级别自定义并行语料库

Customizing Parallel Corpora at the Document Level

摘要

著录项

相似文献

相关主题

期刊订阅