首页> 外文学位 >Study of document retrieval using Latent Semantic Indexing (LSI) on a very large data set.
【24h】

Study of document retrieval using Latent Semantic Indexing (LSI) on a very large data set.

机译:使用潜在语义索引(LSI)对非常大的数据集进行文档检索的研究。

获取原文
获取原文并翻译 | 示例

摘要

The primary purpose of an information retrieval system is to retrieve all the relevant documents, which are relevant to the user query. The Latent Semantic Indexing (LSI) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions/queries. Performance of LSI has been tested for several smaller datasets (e.g., MED, CISI abstracts etc) however, LSI has not been tested for a large dataset. In this research, we concentrated on the performance of LSI on large dataset. Stop word list and term weighting schemes are two key parameters in the area of information retrieval. We investigated the performance of LSI by using three different set of stop word lists and, also, without removing the stop words from the test collection. We also applied three different term-weighting (raw term frequency, log-entropy, and tf-idf) schemes to measure retrieval performance of LSI. We observed that, firstly, for a LSI based ad hoc information retrieval system, a tailored stop word list must be assembled for every unique large dataset. Secondly, the use of tf-idf term weighting scheme shows better retrieval performance than log-entropy and raw term frequency weighting schemes even when the test collection became large.
机译:信息检索系统的主要目的是检索与用户查询相关的所有相关文档。基于潜在语义索引(LSI)的临时文档检索任务研究了使用新问题/查询搜索一组静态文档的检索系统的性能。 LSI的性能已针对多个较小的数据集(例如MED,CISI摘要等)进行了测试,但尚未针对大型数据集进行过LSI测试。在这项研究中,我们集中于大型数据集上LSI的性能。停用词列表和术语加权方案是信息检索领域中的两个关键参数。我们通过使用三种不同的停用词列表来研究LSI的性能,并且也没有从测试集中删除停用词。我们还应用了三种不同的术语加权(原始术语频率,对数熵和tf-idf)方案来衡量LSI的检索性能。我们观察到,首先,对于基于LSI的临时信息检索系统,必须为每个唯一的大型数据集组合定制的停用词列表。其次,即使当测试集变大时,使用tf-idf项加权方案也比对数熵和原始项频率加权方案显示出更好的检索性能。

著录项

  • 作者

    Zaman, A. N. K.;

  • 作者单位

    University of Northern British Columbia (Canada).;

  • 授予单位 University of Northern British Columbia (Canada).;
  • 学科 Computer Science.
  • 学位 M.Sc.
  • 年度 2010
  • 页码 103 p.
  • 总页数 103
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号