The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

Niklas Baumstark; Simon Gog; Tobias Heuer; Julian Labeit

首页> 外文期刊>LIPIcs : Leibniz International Proceedings in Informatics >The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

【24h】

The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

机译：分位数索引-用于Top-k文档检索的简洁自索引

获取原文

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

One of the central problems in information retrieval is that of finding the k documents in a large text collection that best match a query given by a user. A recent result of Navarro & Nekrich (SODA 2012) showed that single term and phrase queries of length m can be solved in optimal O(m+k) time using a linear word sized index. While a verbatim implementation of the index would be at least an order of magnitude larger than the original collection, various authors incrementally improved the index to a point where the space requirement is currently within a factor of 1.5 to 2.0 of the text size for standard collections. In this paper, we propose a new time/space trade-off for different top-k indexes. This is achieved by sampling only a quantile of the postings in the original inverted file or suffix array-based index. For those queries that cannot be answered using the sampled version of the index we show how to compute the query results on the fly efficiently. As an example, we apply our method to the top-k framework by Navarro & Nekrich. Under probabilistic assumptions that hold for most standard texts, and for a standard scoring function called term frequency, our index can be represented with only sublinearly many bits plus the space needed for a compressed suffix array of the text, while maintaining poly-logarithmic query times. We evaluate our solution on real-world datasets and compare its practical space usage and performance against state-of-the-art implementations. Our experiments show that our index compresses below the size of the original text. To our knowledge it is the first suffix array-based text index that is able to break this bound in practice even for non-repetitive collections, while still maintaining reasonable query times of under half a millisecond on average for top-10 queries.

机译：信息检索中的主要问题之一是在大型文本集中找到与用户给出的查询最匹配的k个文档。 Navarro＆Nekrich（SODA 2012）的最新结果表明，长度为m的单词和短语查询可以使用线性单词大小索引在最佳O（m + k）时间内求解。尽管该索引的逐字实现至少比原始集合大一个数量级，但各种作者还是逐步改进了索引，以至于当前空间需求在标准集合的文本大小的1.5到2.0范围内。在本文中，我们针对不同的前k个索引提出了新的时间/空间权衡。这是通过仅对原始反向文件或基于后缀数组的索引中的发布进行采样而实现的。对于那些无法使用样本采样版本回答的查询，我们展示了如何高效地动态计算查询结果。例如，我们将我们的方法应用于Navarro＆Nekrich的top-k框架。在大多数标准文本都适用的概率假设下，以及对于称为术语频率的标准评分功能，我们的索引只能用亚线性表示，多位加上文本压缩后缀数组所需的空间，同时保持多对数查询时间。我们评估我们在真实数据集上的解决方案，并将其实际空间使用情况和性能与最新的实现方式进行比较。我们的实验表明，我们的索引压缩到原始文本的大小以下。据我们所知，它是第一个基于后缀数组的文本索引，即使对于非重复性集合，它也可以在实践中突破这一限制，同时对于前十个查询，平均查询时间平均仍保持在半毫秒以下。

著录项

来源
《LIPIcs : Leibniz International Proceedings in Informatics》 |2017年第4期|共14页
作者
Niklas Baumstark; Simon Gog; Tobias Heuer; Julian Labeit;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Efficient Top-k Document Retrieval for Long Queries Using Term-Document Binary Matrix — Pursuit of Enhanced Informational Search on the Web — [J] . Etsuro FUJITA, Keizo OYAMA IEICE transactions on information and systems . 2013,第5期

机译：使用术语文档二进制矩阵对长查询进行有效的Top-k文档检索-追求增强的Web信息搜索能力-
2. Efficient Top-k Document Retrieval for Long Queries Using Term-Document Binary Matrix: Pursuit of Enhanced Informational Search on the Web [J] . Etsuro Fujita, Keizo Oyama IEICE Transactions on Information and Systems . 2013,第5期

机译：使用术语文档二进制矩阵对长查询进行有效的Top-k文档检索：追求增强的Web信息搜索
3. Practical Compact Indexes for Top-k Document Retrieval [J] . SIMON GOG, ROBERTO KONOW, GONZALO NAVARRO Journal of experimental algorithmics . 2017,第1期

机译：适用于Top-k文档检索的实用紧凑索引
4. Approximating Document Frequency for Self-Index based Top-k Document Retrieval [C] . Suzuki Tokinori, Fujii Atsushi IEEE International Conference on Advanced Information Networking and Applications . 2015

机译：基于自索引的Top-k文档检索的近似文档频率
5. Top-k with diversity-m data retrieval in Wireless Sensor Networks [D] . Puram, Kiran Kumar 2014

机译：无线传感器网络中具有多样性-m数据检索的Top-k
6. Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs [O] . Xiaofeng Yang, Patrick K. Nicholson, Deepak Ajwani, -1

机译：Any-k：随时在标签图中检索前k个树型
7. The Quantile Index - Succinct Self-Index for Top-k Document Retrieval [O] . Baumstark Niklas, Gog Simon, Heuer Tobias, 2017

机译：分位数指数 - Top-k文献检索的简洁自我索引

The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅