An Effective Approach of Extracting Local Documents from the Distributed Representation of Text using Document Embedding and Latent Semantic Analysis

机译：利用文档嵌入和潜在语义分析从文本的分布式表示中提取本地文档的有效方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Document retrieval is the process of extracting the relevant documents on the basis of the defined query. The main problem is on finding the related document based on the local representation of the query. Here local representation means, for e.g. if you want to search for documents related to “bank”. “Bank” will have two representation “bank of a river” (distributed representation) or “Saving Bank” (local representation). But we need documents related to “Saving Bank” which belongs to the local representation of the query “bank”. This paper proposes a new model named Latent Semantic Analysis (LSA) and Document Embedding's to find the relevant documents. This is the initial attempt to combine the document embedding's vectors with LSA All neural embedding models learn distributed representation of text and match the results in the latent semantic space on a given query, but searching documents from the distributed representation will lose the relevance of local representation of a given query. We propose a novel information retrieval system, which uses doc2vec model to give top N similar documents with a relevant ranking using Latent Semantic Indexing to give the top K (documents score is greater than a soft threshold) documents which are the local representation of given query. We can use these K documents to find the most similar ones. We can show that this `dual' combination performs better than other traditional information retrieval algorithm or recently developed neural network models.

机译：文档检索是根据定义的查询提取相关文档的过程。主要问题是基于查询的本地表示找到相关文档。这里的本地代表是指，例如如果要搜索与“银行”相关的文档。 “银行”将有两个表示形式：“河岸”（分布式表示形式）或“储蓄银行”（本地表示形式）。但是我们需要与“储蓄银行”相关的文档，该文档属于查询“银行”的本地表示形式。本文提出了一种名为潜在语义分析（LSA）和文档嵌入的新模型来查找相关文档。这是将文档嵌入的向量与LSA结合在一起的最初尝试。所有神经嵌入模型都学习文本的分布式表示，并在给定查询中匹配潜在语义空间中的结果，但是从分布式表示中搜索文档将失去本地表示的相关性给定查询的我们提出了一种新颖的信息检索系统，该系统使用doc2vec模型通过潜在语义索引为前N个相似文档给出相关排名，并给出前K个（文档得分大于软阈值）文档，这些文档是给定查询的本地表示形式。我们可以使用这K个文档来查找最相似的文档。我们可以证明，这种“双重”组合比其他传统的信息检索算法或最近开发的神经网络模型具有更好的性能。

著录项

来源
《International Conference on Smart Systems and Inventive Technology》|2019年|152-156|共5页
会议地点
作者
Vikas Chib; Ahsan Jafri;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
document handling; feature extraction; indexing; learning (artificial intelligence); natural language processing; neural nets; query processing; text analysis;

机译：文档处理;特征提取;索引编制;学习（人工智能）;自然语言处理;神经网络;查询处理;文本分析;

相似文献

外文文献
中文文献
专利

1. IDENTIFYING TEXT DOCUMENT PATTERN FOR TWO TERMS APPEARANCES VIA LATENT SEMANTIC ANALYSIS (LSA) METHOD AND TERM DISTANCE BETWEEN TWO DOCUMENTS [J] . SOEHARDJOEPRI, NUR IRIAWAN, BRODJOL SUTIJO SU, Journal of Theoretical and Applied Information Technology . 2015,第2期

机译：通过潜在语义分析（LSA）方法和两个文档之间的术语距离来识别两种术语的文本文档模式
2. A HYBRID WORD EMBEDDING MODEL BASED ON ADMIXTURE OF POISSON-GAMMA LATENT DIRICHLET ALLOCATION MODEL AND DISTRIBUTED WORD-DOCUMENT-TOPIC REPRESENTATION [J] . IBRAHIM BAKARI BALA, MOHD ZAINURI SARINGAT, AIDA MUSTAPHA Journal of Theoretical and Applied Information Technology . 2020,第9期

机译：一种基于泊松 - 伽马潜在Dirichlet分配模型和分布式字文档主题表示的混合词嵌入模型
3. Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering [J] . Hanane Froud, Abdelmonaime Lachkar, Said Alaoui Ouatik International Journal of Data Mining & Knowledge Management Process . 2013,第1期

机译：基于潜在语义分析的阿拉伯文本摘要增强阿拉伯文档聚类
4. An Effective Approach of Extracting Local Documents from the Distributed Representation of Text using Document Embedding and Latent Semantic Analysis [C] . Vikas Chib, Ahsan Jafri International Conference on Smart Systems and Inventive Technology . 2019

机译：利用文档嵌入和潜在语义分析从文本分布式表示中提取本地文档的有效方法
5. Generating coherent extracts of single documents using latent semantic analysis. [D] . Miller, Tristan. 2003

机译：使用潜在语义分析生成单个文档的连贯摘要。
6. Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation [O] . Chao Wei, Senlin Luo, Xincheng Ma, 2011

机译：局部嵌入自动编码器：一种半监督的流形学习的文档表示形式
7. TM-SGTD: Text Mining Based Semantic Graph for Text Document Approach for Text Representation [O] . Ashish Pacharne, Pramod S Nair, Srinivasa Rao D 2017

机译：TM-SGTD：文本文档方法的文本挖掘语义图文本表示

An Effective Approach of Extracting Local Documents from the Distributed Representation of Text using Document Embedding and Latent Semantic Analysis

摘要

著录项

相似文献

相关主题

期刊订阅