首页> 外文OA文献 >Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings
【2h】

Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings

机译:基于(双语)词嵌入的单语和跨语言信息检索模型

摘要

We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).
机译:我们提出了一种新的统一的单语言(MoIR)和跨语言信息检索(CLIR)框架,该框架依赖于从可比数据中引出的密集实值单词向量,即单词嵌入(WE)。为此,我们做出了一些重要的贡献:(1)我们提出了一种新颖的单词表示学习模型,称为双语单词嵌入跳过语言(BWESG),这是第一个能够仅基于文档对齐来学习双语单词嵌入的模型可比数据; (2)我们展示了一种简单有效的方法,可以利用组成分布语义的模型从单个单词嵌入中构建文档嵌入。 BWESG引入了一个共享的跨语言嵌入向量空间,其中单词,查询和文档都可以表示为密集的实值向量; (3)我们建立了新颖的临时MoIR和CLIR模型,该模型依赖于诱导的单词和文档嵌入以及共享的跨语言嵌入空间; (4)使用基准CLEF 2001-2003集合和查询进行的英语和荷兰语MoIR以及英语到荷兰语和荷兰语到英语CLIR的实验证明了我们基于WE的MoIR和CLIR模型的实用性。通过基于WE的方法和unigram语言模型的组合,可以获得CLEF集合的最佳结果。我们还报告了基于WE的框架的临时IR任务相对于基于潜在Dirichlet分配(LDA)从可比较数据中学习文本表示的最新框架的显着改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号