Customizing Parallel Corpora at the Document Level

机译：在文档级别自定义并行语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performance drop caused by a good but mismatched resource, and even improving CLIR effectiveness by 4-7% when compared to using all available training data.

机译：跨语言信息检索（CLIR）的最新研究表明，需要将用于查询翻译的并行语料正确匹配到目标语料。我们提出了一种文档级方法来解决此问题：通过从其他并行语料库中获取的文档自动组装来构建定制的并行语料库。尽管一般思想可以应用于使用并行语料库的任何应用程序，但是我们在医学领域提供了CLIR的结果。为了从几个并行语料库中提取最匹配的文档，我们建议通过使用它们与目标语料库之间基于长度归一化的基于Okapi的相似度评分对单个文档进行排名。该排名使我们可以丢弃50-90％的训练数据，同时避免了由于良好但不匹配的资源而导致的性能下降，甚至与使用所有可用的训练数据相比，CLIR效果甚至提高了4-7％。

著录项

来源
《Proceedings of the Student Research Workshop, Interactive Posters/Demonstrations, and Tutorial Abstracts 》|2004年|P.109-112|共4页
会议地点
作者
Monica ROGATI; Yiming YANG;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件 ;
关键词

相似文献

外文文献
中文文献
专利

1. Sentence Level Alignment of Digitized Books Parallel Corpora [J] . Laukaitis Algirdas, Plikynas Darius, Ostasius Egidijus Informatica . 2018 ,第4期

机译：数字书籍平行语料库的句级对齐
2. Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora [J] . Hermann Moisla* Journal of Quantitative Linguistics . 2011 ,第1期

机译：寻找最小文档长度以可靠地聚类多文档自然语言语料库
3. A scaleable document clustering approach for large document corpora [J] . Niall Rooney, David Patterson, Mykola Galushka, Information Processing & Management . 2006 ,第5期

机译：大型文档语料库的可扩展文档聚类方法
4. Customizing Parallel Corpora at the Document Level [C] . Monica ROGATI, Yiming YANG, Association for Computational Linguistics(ACL)(US) Association for Computational Linguistics Annual Meeting . 2004

机译：在文档级别自定义并行语料库
5. Parallel Corpora and Pedagogy: Enhancing Chinese Foreign Language Learning Experience Through Parallel Corpus Technology [D] . Bluemel, Brody T. 2015

机译：平行语料库和教育学：通过并行语料库技术提高中国外语学习体验
6. Extracting Parallel Sentences from Nonparallel Corpora Using Parallel Hierarchical Attention Network [O] . Shaolin Zhu, Yong Yang, Chun Xu 2020

机译：使用并行分层注意网络从非平行语料库中提取并行句子
7. Customizing Parallel Corpora at the Document Level [O] . Monica Rogati, Yiming Yang 2009

机译：在文档级别自定义并行语料库

Customizing Parallel Corpora at the Document Level

摘要

著录项

相似文献

相关主题

期刊订阅