首页> 外文会议>International Conference on Smart Systems and Inventive Technology >An Effective Approach of Extracting Local Documents from the Distributed Representation of Text using Document Embedding and Latent Semantic Analysis
【24h】

An Effective Approach of Extracting Local Documents from the Distributed Representation of Text using Document Embedding and Latent Semantic Analysis

机译:利用文档嵌入和潜在语义分析从文本的分布式表示中提取本地文档的有效方法

获取原文

摘要

Document retrieval is the process of extracting the relevant documents on the basis of the defined query. The main problem is on finding the related document based on the local representation of the query. Here local representation means, for e.g. if you want to search for documents related to “bank”. “Bank” will have two representation “bank of a river” (distributed representation) or “Saving Bank” (local representation). But we need documents related to “Saving Bank” which belongs to the local representation of the query “bank”. This paper proposes a new model named Latent Semantic Analysis (LSA) and Document Embedding's to find the relevant documents. This is the initial attempt to combine the document embedding's vectors with LSA All neural embedding models learn distributed representation of text and match the results in the latent semantic space on a given query, but searching documents from the distributed representation will lose the relevance of local representation of a given query. We propose a novel information retrieval system, which uses doc2vec model to give top N similar documents with a relevant ranking using Latent Semantic Indexing to give the top K (documents score is greater than a soft threshold) documents which are the local representation of given query. We can use these K documents to find the most similar ones. We can show that this `dual' combination performs better than other traditional information retrieval algorithm or recently developed neural network models.
机译:文档检索是根据定义的查询提取相关文档的过程。主要问题是基于查询的本地表示找到相关文档。这里的本地代表是指,例如如果要搜索与“银行”相关的文档。 “银行”将有两个表示形式:“河岸”(分布式表示形式)或“储蓄银行”(本地表示形式)。但是我们需要与“储蓄银行”相关的文档,该文档属于查询“银行”的本地表示形式。本文提出了一种名为潜在语义分析(LSA)和文档嵌入的新模型来查找相关文档。这是将文档嵌入的向量与LSA结合在一起的最初尝试。所有神经嵌入模型都学习文本的分布式表示,并在给定查询中匹配潜在语义空间中的结果,但是从分布式表示中搜索文档将失去本地表示的相关性给定查询的我们提出了一种新颖的信息检索系统,该系统使用doc2vec模型通过潜在语义索引为前N个相似文档给出相关排名,并给出前K个(文档得分大于软阈值)文档,这些文档是给定查询的本地表示形式。我们可以使用这K个文档来查找最相似的文档。我们可以证明,这种“双重”组合比其他传统的信息检索算法或最近开发的神经网络模型具有更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号