Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Ivan Vulić; Wim De Smet; Marie-Francine Moens

首页> 外文期刊>Information Retrieval >Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

【24h】

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we study different applications of cross-language latent topic models trained on comparable corpora. The first focus lies on the task of cross-language information retrieval (CLIR). The Bilingual Latent Dirichlet allocation model (BiLDA) allows us to create an interlingual, language-independent representation of both queries and documents. We construct several BiLDA-based document models for CLIR, where no additional translation resources are used. The second focus lies on the methods for extracting translation candidates and semantically related words using only per-topic word distributions of the cross-language latent topic model. As the main contribution, we combine the two former steps, blending the evidences from the per-document topic distributions and the per-topic word distributions of the topic model with the knowledge from the extracted lexicon. We design and evaluate the novel evidence-rich statistical model for CLIR, and prove that such a model, which combines various (only internal) evidences, obtains the best scores for experiments performed on the standard test collections of the CLEF 2001–2003 campaigns. We confirm these findings in an alternative evaluation, where we automatically generate queries and perform the known-item search on a test subset of Wikipedia articles. The main importance of this work lies in the fact that we train translation resources from comparable document-aligned corpora and provide novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine-readable dictionaries.

机译：在本文中，我们研究了在可比语料库上训练的跨语言潜在主题模型的不同应用。第一个重点是跨语言信息检索（CLIR）的任务。双语潜在Dirichlet分配模型（BiLDA）使我们能够创建查询和文档的语言间，独立于语言的表示形式。我们为CLIR构建了几个基于BiLDA的文档模型，其中没有使用其他翻译资源。第二个重点是仅使用跨语言潜在主题模型的按主题单词分布提取翻译候选单词和语义相关单词的方法。作为主要贡献，我们结合了前两个步骤，将主题模型的按文档主题分布和按主题单词分布的证据与提取的词典中的知识混合在一起。我们设计和评估了CLIR的新颖的，证据丰富的统计模型，并证明了该模型结合了各种（仅内部的）证据，从而为在CLEF 2001-2003活动的标准测试集上进行的实验获得了最佳分数。我们在替代评估中确认了这些发现，在该评估中，我们将自动生成查询，并对维基百科文章的测试子集执行已知项搜索。这项工作的主要重要性在于以下事实：我们训练来自可比的文档对齐语料库的翻译资源，并提供新颖的CLIR统计模型，该模型详尽地利用尽可能多的跨语言线索来寻求更好的CLIR结果，而无需使用任何其他外部资源，例如并行语料库或机器可读词典。

著录项

来源
《Information Retrieval》 |2013年第3期|331-368|共38页
作者
Ivan Vulić; Wim De Smet; Marie-Francine Moens;
展开▼
作者单位

Department of Computer Science KU Leuven">(1);

Department of Computer Science KU Leuven">(1);

Department of Computer Science KU Leuven">(1);

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Cross-language information retrieval; Unsupervised cross-language lexicon extraction; Probabilistic latent topic models; Evidence-rich retrieval models;

机译：跨语言信息检索;无监督的跨语言词典提取;概率潜在主题模型;证据丰富的检索模型;

相似文献

外文文献
中文文献
专利

1. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [J] . Ivan Vulic, Wim De Smet, Marie-Francine Moens Information retrieval . 2013,第3期

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练
2. Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework [J] . Razieh Rahimi, Azadeh Shakery, Irwin King Information Processing & Management . 2016,第2期

机译：使用语言建模框架从可比较的语料库中提取翻译以进行跨语言信息检索
3. A novel fuzzy k-means latent semantic analysis (FKLSA) approach for topic modeling over medical and health text corpora [J] . Rashid Junaid, Shah Syed Muhammad Adnan, Irtaza Aun Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2019,第5aPta2期

机译：关于医疗和健康文本语料库主题建模的新型模糊k型潜在语义分析（FKLSA）方法
4. Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus [C] . Ivan Vulic, Wim De Smet, Marie-Francine Moens Information retrieval technology . 2011

机译：在可比语料库上训练具有潜在主题模型的跨语言信息检索
5. Evaluation and Improvement of the Modis Liquid Water Path Retrievals Using A-train Satellite and Ground-based Remote Sensing Measurements and Radiative Transfer Modelling [D] . Khanal, Sujan. 2019

机译：使用火车卫星和基于地基遥感测量和辐射转移建模的Modis液体水路检索的评估与改进
6. Clinical Case-based Retrieval Using Latent Topic Analysis [O] . Corey W. Arnold, Suzie M. El-Saden, Alex A.T. Bui, 2010

机译：使用潜在主题分析的基于临床病例的检索
7. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora [O] . Vulic Ivan, De Smet Wim, Moens Marie-Francine 2013

机译：基于潜在主题模型的跨语言信息检索模型，该主题模型经过与文档对齐的可比语料库训练

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

摘要

著录项

相似文献

相关主题

期刊订阅