首页> 外文期刊>LIPIcs : Leibniz International Proceedings in Informatics >The Quantile Index - Succinct Self-Index for Top-k Document Retrieval
【24h】

The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

机译:分位数索引-用于Top-k文档检索的简洁自索引

获取原文
获取外文期刊封面目录资料

摘要

One of the central problems in information retrieval is that of finding the k documents in a large text collection that best match a query given by a user. A recent result of Navarro & Nekrich (SODA 2012) showed that single term and phrase queries of length m can be solved in optimal O(m+k) time using a linear word sized index. While a verbatim implementation of the index would be at least an order of magnitude larger than the original collection, various authors incrementally improved the index to a point where the space requirement is currently within a factor of 1.5 to 2.0 of the text size for standard collections. In this paper, we propose a new time/space trade-off for different top-k indexes. This is achieved by sampling only a quantile of the postings in the original inverted file or suffix array-based index. For those queries that cannot be answered using the sampled version of the index we show how to compute the query results on the fly efficiently. As an example, we apply our method to the top-k framework by Navarro & Nekrich. Under probabilistic assumptions that hold for most standard texts, and for a standard scoring function called term frequency, our index can be represented with only sublinearly many bits plus the space needed for a compressed suffix array of the text, while maintaining poly-logarithmic query times. We evaluate our solution on real-world datasets and compare its practical space usage and performance against state-of-the-art implementations. Our experiments show that our index compresses below the size of the original text. To our knowledge it is the first suffix array-based text index that is able to break this bound in practice even for non-repetitive collections, while still maintaining reasonable query times of under half a millisecond on average for top-10 queries.
机译:信息检索中的主要问题之一是在大型文本集中找到与用户给出的查询最匹配的k个文档。 Navarro&Nekrich(SODA 2012)的最新结果表明,长度为m的单词和短语查询可以使用线性单词大小索引在最佳O(m + k)时间内求解。尽管该索引的逐字实现至少比原始集合大一个数量级,但各种作者还是逐步改进了索引,以至于当前空间需求在标准集合的文本大小的1.5到2.0范围内。在本文中,我们针对不同的前k个索引提出了新的时间/空间权衡。这是通过仅对原始反向文件或基于后缀数组的索引中的发布进行采样而实现的。对于那些无法使用样本采样版本回答的查询,我们展示了如何高效地动态计算查询结果。例如,我们将我们的方法应用于Navarro&Nekrich的top-k框架。在大多数标准文本都适用的概率假设下,以及对于称为术语频率的标准评分功能,我们的索引只能用亚线性表示,多位加上文本压缩后缀数组所需的空间,同时保持多对数查询时间。我们评估我们在真实数据集上的解决方案,并将其实际空间使用情况和性能与最新的实现方式进行比较。我们的实验表明,我们的索引压缩到原始文本的大小以下。据我们所知,它是第一个基于后缀数组的文本索引,即使对于非重复性集合,它也可以在实践中突破这一限制,同时对于前十个查询,平均查询时间平均仍保持在半毫秒以下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号