首页> 外文OA文献 >The Quantile Index - Succinct Self-Index for Top-k Document Retrieval
【2h】

The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

机译:分位数指数 - Top-k文献检索的简洁自我索引

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

One of the central problems in information retrieval is that of finding the k documents in a large text collection that best match a query given by a user. A recent result of Navarro & Nekrich (SODA 2012) showed that single term and phrase queries of length m can be solved in optimal O(m+k) time using a linear word sized index. While a verbatim implementation of the index would be at least an order of magnitude larger than the original collection, various authors incrementally improved the index to a point where the space requirement is currently within a factor of 1.5 to 2.0 of the text size for standard collections.In this paper, we propose a new time/space trade-off for different top-k indexes. This is achieved by sampling only a quantile of the postings in the original inverted file or suffix array-based index. For those queries that cannot be answered using the sampled version of the index we show how to compute the query results on the fly efficiently. As an example, we apply our method to the top-k framework by Navarro & Nekrich. Under probabilistic assumptions that hold for most standard texts, and for a standard scoring function called term frequency, our index can be represented with only sublinearly many bits plus the space needed for a compressed suffix array of the text, while maintaining poly-logarithmic query times. We evaluate our solution on real-world datasets and compare its practical space usage and performance against state-of-the-art implementations. Our experiments show that our index compresses below the size of the original text. To our knowledge it is the first suffix array-based text index that is able to break this bound in practice even for non-repetitive collections, while still maintaining reasonable query times of under half a millisecond on average for top-10 queries.
机译:信息检索中的主要问题之一是在大型文本集中找到与用户给出的查询最匹配的k个文档。 Navarro&Nekrich(SODA 2012)的最新结果表明,长度为m的单项和短语查询可以使用线性单词大小索引在最佳O(m + k)时间内求解。尽管该索引的逐字实现至少比原始集合大一个数量级,但是各种作者逐渐改进了索引,以至于当前空间需求为标准集合的文本大小的1.5到2.0倍在本文中,我们针对不同的前k个索引提出了新的时间/空间权衡。这是通过仅对原始反向文件或基于后缀数组的索引中的发布进行采样而实现的。对于那些无法使用样本采样版本回答的查询,我们展示了如何高效地动态计算查询结果。例如,我们将方法应用于Navarro&Nekrich的top-k框架。在适用于大多数标准文本的概率假设以及称为术语频率的标准评分功能下,我们的索引只能用亚线性表示,多位加上文本的压缩后缀数组所需的空间,同时保持对数查询时间。我们评估我们在实际数据集上的解决方案,并将其实际空间使用情况和性能与最新的实现方式进行比较。我们的实验表明,我们的索引压缩到原始文本的大小以下。据我们所知,它是第一个基于后缀数组的文本索引,即使对于非重复性集合,它也可以在实践中突破这一限制,同时对于前十个查询,平均查询时间平均仍保持在半毫秒以下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号