首页> 外文学位 >Techniques for improved LSI text retrieval.
【24h】

Techniques for improved LSI text retrieval.

机译:改进LSI文本检索的技术。

获取原文
获取原文并翻译 | 示例

摘要

This work identifies and studies four major issues in LSI (Latent Semantic Indexing) text retrieval: a multiplicity of standard query methods, alternative non-standard query methods, the issue of Generic Terms, and the lacking of Structural Data.; Firstly, three commonly-used standard query methods (versions A, B and B') are identified, compared, analyzed, and tested. Both mathematical analysis and experimental results reveal that version B is a better choice than version A, and that versions B and B' are essentially equivalent provided that the Equivalency Principle is satisfied. This finding shall eliminate the confusion and randomness of applying possibly incompatible query methods among LSI researchers and help restore the comparability of their works.; Secondly, some novel non-standard versions of query methods using the discovered technique of singular value rescaling (SVR) are proposed and studied. Testing results in the prototyping experimental environments and the standardized TREC data sets both confirmed the effectiveness of SVR. This finding bears the practical significance that the current information retrieval techniques may be significantly improved by simply adopting a novel query method which is computationally as efficient as the best standard query method.; Thirdly, this work studies the effects of Generic Terms, a minority group of terms that have relatively uniform distribution pattern among all topics of documents, on the LSI models. Characterization and definition of Generic Terms are achieved and an iterative algorithm is designed and implemented to identify these special terms. Experimental results strongly suggest that identification and exclusion of Generic Terms helps improve LSI text retrieval performance.; Fourthly, this work also studies how to integrate Structural Data (loosely defined as sentence structure) into the LSI models. Four major characteristics of Structural Data are identified: derivativity, maneuverability, language dependency, and updatability/downdatability. Qualifications of two candidate forms of Structural Data, i.e., word order and non-word-order syntax (both in English language), are carefully studied. A complete series of procedures are developed to fully integrate Structural Data (in its most qualified form of word order data) into the LSI models. Experimental results strongly suggest that acquisition and integration of Structural Data helps improve LSI text retrieval performance.
机译:这项工作确定并研究了LSI(潜在语义索引)文本检索中的四个主要问题:多种标准查询方法,替代性非标准查询方法,通用术语问题以及缺乏结构数据。首先,确定,比较,分析和测试三种常用的标准查询方法(版本A,B和B')。数学分析和实验结果均表明,版本B比版本A更好,并且只要满足等效原则,版本B和B'实质上是等效的。这一发现将消除在LSI研究人员中应用可能不兼容的查询方法的困惑和随机性,并有助于恢复其工作的可比性。其次,提出并研究了一些新的非标准版本的查询方法,这些方法使用了发现的奇异值重定标度(SVR)技术。在原型实验环境中的测试结果和标准化的TREC数据集均证实了SVR的有效性。这一发现具有实际意义,即通过简单地采用一种在计算上与最佳标准查询方法一样有效的新颖查询方法,可以显着改善当前的信息检索技术。第三,这项工作研究了通用术语(少数术语,在文档的所有主题之间具有相对统一的分配模式)对LSI模型的影响。实现了通用术语的表征和定义,并设计并实现了一种迭代算法来识别这些特殊术语。实验结果强烈表明,对通用术语的识别和排除有助于提高LSI文本检索性能。第四,这项工作还研究了如何将结构数据(通常定义为句子结构)集成到LSI模型中。确定了结构数据的四个主要特征:衍生性,可操作性,语言依赖性和可更新性/可压缩性。仔细研究了结构数据的两种候选形式的资格,即单词顺序和非单词顺序语法(均为英语)。开发了一系列完整的过程,以将结构数据(以其最优质的字序数据形式)完全集成到LSI模型中。实验结果强烈表明,结构数据的获取和集成有助于提高LSI文本检索性能。

著录项

  • 作者

    Yan, Hua.;

  • 作者单位

    Wayne State University.;

  • 授予单位 Wayne State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 p.1535
  • 总页数 220
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号