【24h】

Compressed Web Indexes

机译:压缩网页索引

获取原文

摘要

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web.We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.
机译:Web搜索引擎使用索引来有效地检索包含指定查询词的页面以及链接到指定页面的页面。允许快速检索的压缩索引问题历史悠久。我们考虑这个问题:假设页面中的术语(或指向页面的链接)是从概率分布中生成的,那么我们可以在多大程度上构建出允许快速检索的索引呢?当概率分布为Zipfian(或类似的幂定律)时,会特别引起关注,因为这些是网络上出现的分布。 对于遵循齐普夫定律的文本文档,我们对布尔索引的空间要求有了明确的界限。在此过程中,我们开发了一种适用于任何概率分布的通用技术,不一定适用于幂定律。这是对任意分布下的索引压缩的首次分析。我们的界限导致了索引的民俗学的定量版本的经验法则。我们对多个文档集的实验表明,术语的分布似乎遵循双重帕累托定律而不是齐普夫定律。尽管有各式各样的文档集,但实验中观察到的索引大小仍符合我们的理论预测。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号