首页> 外文OA文献 >Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora
【2h】

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

机译:基于潜在主题模型的跨语言信息检索模型,该主题模型经过与文档对齐的可比语料库训练

摘要

In this paper, we study different applications of cross-language latent topic models trained on comparable corpora. The first focus lies on the task of cross-language information retrieval (CLIR). The Bilingual Latent Dirichlet Allocation model (BiLDA) allows us to create an interlingual, language-independent representation of both queries and documents. We construct several BiLDA-based document models for CLIR, where no additional translation resources are used. The second focus lies on the methods for extracting translation candidates and semantically related words using only per-topic word distributions of the cross-language latent topic model. As the main contribution, we combine the two former steps, blending the evidences from the per-document topic distributions and the per-topic word distributions of the topic model with the knowledge from the extracted lexicon. We design and evaluate the novel evidence-rich statistical model for CLIR, and prove that such a model, which combines various (only internal) evidences, obtains the best scores for experiments performed on the standard test collections of the CLEF 2001-2003 campaigns. We confirm these findings in an alternative evaluation, where we automatically generate queries and perform the known-item search on a test subset of Wikipedia articles. The main importance of this work lies in the fact that we train translation resources from comparable document-aligned corpora and provide novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine readable dictionaries.
机译:在本文中,我们研究了在可比语料库上训练的跨语言潜在主题模型的不同应用。第一个重点是跨语言信息检索(CLIR)的任务。双语潜在狄利克雷分配模型(BiLDA)使我们能够创建查询和文档的语言间,语言无关的表示形式。我们为CLIR构建了几个基于BiLDA的文档模型,其中没有使用其他翻译资源。第二个重点是仅使用跨语言潜在主题模型的按主题单词分布提取翻译候选单词和语义相关单词的方法。作为主要贡献,我们结合了前两个步骤,将主题模型的按文档主题分布和按主题单词分布的证据与提取的词典中的知识混合在一起。我们设计和评估了CLIR的新颖的,证据丰富的统计模型,并证明了该模型结合了各种(仅内部的)证据,为在CLEF 2001-2003活动的标准测试集上进行的实验获得了最佳成绩。我们在替代评估中确认了这些发现,在该评估中,我们将自动生成查询,并对维基百科文章的测试子集执行已知项搜索。这项工作的主要重要性在于以下事实:我们训练可比较的文档对齐语料库的翻译资源,并提供新颖的CLIR统计模型,该模型详尽地利用尽可能多的跨语言线索来寻求更好的CLIR结果,而无需使用任何其他外部资源,例如并行语料库或机器可读词典。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号