首页> 外文期刊>ACM Transactions on Information Systems >Concept-Based Information Retrieval Using Explicit Semantic Analysis
【24h】

Concept-Based Information Retrieval Using Explicit Semantic Analysis

机译:基于显式语义分析的基于概念的信息检索

获取原文
获取原文并翻译 | 示例
           

摘要

Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.
机译:传统上,信息检索系统依靠文本关键字来索引和检索文档。当不同的关键字用于描述文档和查询中的相同概念时,基于关键字的检索可能会返回不准确和不完整的结果。此外,这些相关关键字之间的关系可能是语义性的,而不是句法性的,因此要捕获它,就需要获得全面的人类世界知识。基于概念的检索方法已尝试通过使用人工构建的叙词表,依靠术语同现数据或从语料库中提取潜在的单词关系和概念来解决这些难题。在本文中,我们介绍了一种基于显式语义分析(ESA)的基于概念的新检索方法,该方法是最近提出的一种方法,该方法利用基于概念的功能增强了基于关键字的文本表示,并从诸如Wikipedia的大量人类知识库中自动提取了该方法。我们的方法自动生成新的文本特征,并且我们发现,高质量的特征选择在此设置中变得至关重要,以使检索更加集中。然而,由于缺乏标记数据,传统的特征选择方法无法使用,因此我们提出了使用自行生成的标记训练数据的新方法。在多个TREC数据集上对生成的系统进行了评估,显示出比以前的最新结果更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号