首页> 外文会议>Advances in Information Retrieval >Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR
【24h】

Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR

机译:对齐语料库质量和大小对基于语料库的CLIR的影响

获取原文
获取原文并翻译 | 示例

摘要

Aligned corpora are often-used resources in CLIR systems. The three qualities of translation corpora that most dramatically affect the performance of a corpus-based CLIR system are: (1) topical nearness to the translated queries, (2) the quality of the alignments, and (3) the size of the corpus. In this paper, the effects of these factors are studied and evaluated. Topics of two different domains (news and genomics) are translated with corpora of varying alignment quality, ranging from a clean parallel corpus to noisier comparable corpora. Also, the sizes of the corpora are varied. The results show that of the three qualities, topical nearness is the most crucial factor, outweighing both other factors. This indicates that noisy comparable corpora should be used as complimentary resources, when parallel corpora are not available for the domain in question.
机译:对齐的语料库是CLIR系统中经常使用的资源。最显着影响基于语料库的CLIR系统的性能的三种翻译语料库质量是:(1)主题与翻译查询的接近程度;(2)比对的质量;(3)语料库的大小。在本文中,对这些因素的影响进行了研究和评估。两个不同领域(新闻和基因组学)的主题将使用对齐质量不同的语料库进行翻译,这些语料库的类型从干净的平行语料库到噪音更大的可比语料库。同样,语料库的大小也有所不同。结果表明,在这三种特质中,局部接近性是最关键的因素,远大于其他两个因素。这表示当并行语料库不适用于所讨论的域时,应将嘈杂的可比语料库用作补充资源。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号