首页> 外文期刊>ACM transactions on Asian language information processing >Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages
【24h】

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

机译:使用多语言单词向量衍生的单词社区进行印度语言的跨语言信息检索

获取原文
获取原文并翻译 | 示例
       

摘要

We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CUR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR.We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CUR. We also find that better-quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source nor the target languages. This is probably because having more similar words across multiple languages helps define well-defined dense subclusters that help us obtain precise query translations.In this article, we demonstrate the use of multilingual word embeddings and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati, and Tamil.
机译:我们调查了单词嵌入在查询翻译中的使用,以提高跨语言信息检索(CUR)的准确性。词向量表示分布空间中的词,以使句法或语义上相似的词在该空间中彼此接近。多语言单词嵌入的构建方式使得跨语言的相似单词具有相似的矢量表示。我们探索了从可比的印度语语料库中学到的双语和多语单词嵌入对CLIR任务的有效利用。我们提出了一种基于多语单词向量的聚类方法,将跨语言的相似单词进行分组。为此,我们构造了一个图,其中使用了来自多种语言的单词作为节点,并使用连接相似矢量的单词的边。我们使用Louvain方法进行社区检测以在该图中找到社区。我们表明,从包含查询词的群集或社区中选择目标语言单词作为查询翻译有助于改善CUR。我们还发现,即使其他语言既不是源语言也不是目标语言,使用来自更多语言的单词进行聚类时,也会获得质量更高的查询翻译。这可能是因为在多种语言中使用更多相似的单词有助于定义定义明确的密集子簇,从而帮助我们获得精确的查询翻译。在本文中,我们演示了多语言单词嵌入和单词簇在涉及印度语的CLIR中的使用。我们还提供了一种工具,用于获取相关单词以及英语,北印度语,孟加拉语,马拉地语,古吉拉特语和泰米尔语的多语言单词向量的可视化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号