首页> 外文会议>Annual international ACM SIGIR conference on research and development in information retrieval >Analysis of Lexical Signatures for Finding Lost or Related Documents
【24h】

Analysis of Lexical Signatures for Finding Lost or Related Documents

机译:解读丢失或相关文件的词汇签名分析

获取原文

摘要

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize docu-ment frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.
机译:即使其URL已更改,网页的词汇签名通常足以用于查找页面。我们对八种方法进行了大规模的实证研究,用于生成词汇签名,包括菲尔普斯和Wilensky的原始提案(PW)和我们自己的七种。我们在Web上和TREC数据集上检查它们的性能,评估它们的能力,以唯一地识别原始文档,如果原始丢失,则定位其他相关文档。选择最小化Docu-Ment频率(DF)的词汇签名在唯一的识别中是良好的,但在找到相关文件时较差。 PW适用于相对较小的TREC数据集,但几乎相同地与Web上的DF相同,其中包含数十亿个文档。基于术语的基于词汇签名(TF)非常容易计算并且通常表现良好,但高度依赖于所使用的搜索引擎的排名系统。通常,基于TFIDF的方法和混合方法(与TF或TFIDF组合)似乎是产生有效词汇签名的最有希望的候选者。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号