首页> 外文学位 >A term co-occurrence based framework for understanding LSI: Theory and practice.
【24h】

A term co-occurrence based framework for understanding LSI: Theory and practice.

机译:用于理解LSI的基于共现的术语框架:理论与实践。

获取原文
获取原文并翻译 | 示例

摘要

Automatic methods for searching textual collections have been developed since the early 1960's, but a global solution to the problem remains elusive. Latent Semantic Indexing (LSI) is a well-known information retrieval algorithm. LSI is based on a linear algebraic technique, Singular Value Decomposition (SVD).; The primary goal of this dissertation is the development of a theoretical framework for understanding LSI. In particular, we study the values produced by the SVD process and determine their impact on LSI performance. We use two approaches to this analysis of values, and develop two practical applications based on our improved knowledge of the relationship between the values in the truncated matrices and the performance of LSI.; The focus in the first part of this dissertation is the development of a theoretical framework for understanding LSI. Our framework is based on the concept of term co-occurrences, and we prove that LSI encapsulates term co-occurrence information. We also show a strong correlation between the retrieval quality of LSI and the distribution of the term co-occurrence weights.; In the second part of this document, we focus our study of the values produced by SVD by implementing several practical applications. First, we determine the most critical values of the LSI matrices by reducing the density of the matrices by up to 70% without impacting retrieval quality. This reduction results in memory requirement decrease of 55% during query run time. We also develop a term clustering algorithm that is based on the LSI term matrix. This algorithm is shown to develop effective clusters for use in an emerging trend detection application. Our emerging trend detection system was able to achieve .81–.89 f-measure (beta = 1) for several collections.
机译:自1960年代初以来,已经开发出了自动搜索文本集合的方法,但是对于该问题的全球解决方案仍然难以捉摸。潜在语义索引(LSI)是一种众所周知的信息检索算法。 LSI基于线性代数技术,奇异值分解(SVD)。本文的主要目的是为理解LSI的理论框架的发展。特别是,我们研究了SVD过程产生的值,并确定它们对LSI性能的影响。我们使用两种方法进行值分析,并基于对截断矩阵中的值与LSI性能之间关系的了解,开发了两个实际应用程序。本文第一部分的重点是建立一个理解LSI的理论框架。我们的框架基于术语共现的概念,并且我们证明LSI封装了术语共现信息。我们还显示了LSI的检索质量与术语共现权重的分布之间有很强的相关性。在本文档的第二部分中,我们将重点研究通过实现一些实际应用而对SVD产生的值的研究。首先,我们通过在不影响检索质量的情况下将矩阵密度降低多达70%来确定LSI矩阵的最关键值。这种减少导致查询运行期间内存需求减少了55%。我们还开发了基于LSI术语矩阵的术语聚类算法。该算法显示出可以开发有效的聚类,用于新兴的趋势检测应用程序。我们新兴的趋势检测系统能够针对多个集合实现0.81–.89 f测度(β= 1)。

著录项

  • 作者

    Kontostathis, April.;

  • 作者单位

    Lehigh University.;

  • 授予单位 Lehigh University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2003
  • 页码 110 p.
  • 总页数 110
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号