首页> 外文OA文献 >The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

【2h】

The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

机译：分位数指数 - Top-k文献检索的简洁自我索引

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

One of the central problems in information retrieval is that of finding the k documents in a large text collection that best match a query given by a user. A recent result of Navarro & Nekrich (SODA 2012) showed that single term and phrase queries of length m can be solved in optimal O(m+k) time using a linear word sized index. While a verbatim implementation of the index would be at least an order of magnitude larger than the original collection, various authors incrementally improved the index to a point where the space requirement is currently within a factor of 1.5 to 2.0 of the text size for standard collections.In this paper, we propose a new time/space trade-off for different top-k indexes. This is achieved by sampling only a quantile of the postings in the original inverted file or suffix array-based index. For those queries that cannot be answered using the sampled version of the index we show how to compute the query results on the fly efficiently. As an example, we apply our method to the top-k framework by Navarro & Nekrich. Under probabilistic assumptions that hold for most standard texts, and for a standard scoring function called term frequency, our index can be represented with only sublinearly many bits plus the space needed for a compressed suffix array of the text, while maintaining poly-logarithmic query times. We evaluate our solution on real-world datasets and compare its practical space usage and performance against state-of-the-art implementations. Our experiments show that our index compresses below the size of the original text. To our knowledge it is the first suffix array-based text index that is able to break this bound in practice even for non-repetitive collections, while still maintaining reasonable query times of under half a millisecond on average for top-10 queries.

机译：信息检索中的主要问题之一是在大型文本集中找到与用户给出的查询最匹配的k个文档。 Navarro＆Nekrich（SODA 2012）的最新结果表明，长度为m的单项和短语查询可以使用线性单词大小索引在最佳O（m + k）时间内求解。尽管该索引的逐字实现至少比原始集合大一个数量级，但是各种作者逐渐改进了索引，以至于当前空间需求为标准集合的文本大小的1.5到2.0倍在本文中，我们针对不同的前k个索引提出了新的时间/空间权衡。这是通过仅对原始反向文件或基于后缀数组的索引中的发布进行采样而实现的。对于那些无法使用样本采样版本回答的查询，我们展示了如何高效地动态计算查询结果。例如，我们将方法应用于Navarro＆Nekrich的top-k框架。在适用于大多数标准文本的概率假设以及称为术语频率的标准评分功能下，我们的索引只能用亚线性表示，多位加上文本的压缩后缀数组所需的空间，同时保持对数查询时间。我们评估我们在实际数据集上的解决方案，并将其实际空间使用情况和性能与最新的实现方式进行比较。我们的实验表明，我们的索引压缩到原始文本的大小以下。据我们所知，它是第一个基于后缀数组的文本索引，即使对于非重复性集合，它也可以在实践中突破这一限制，同时对于前十个查询，平均查询时间平均仍保持在半毫秒以下。

著录项

作者
Baumstark Niklas; Gog Simon; Heuer Tobias; Labeit Julian;
展开▼
作者单位

展开▼
年度 2017
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. The Quantile Index - Succinct Self-Index for Top-k Document Retrieval [J] . Niklas Baumstark, Simon Gog, Tobias Heuer, LIPIcs : Leibniz International Proceedings in Informatics . 2017,第4期

机译：分位数索引-用于Top-k文档检索的简洁自索引
2. Efficient Top-k Document Retrieval for Long Queries Using Term-Document Binary Matrix — Pursuit of Enhanced Informational Search on the Web — [J] . Etsuro FUJITA, Keizo OYAMA IEICE transactions on information and systems . 2013,第5期

机译：使用术语文档二进制矩阵对长查询进行有效的Top-k文档检索-追求增强的Web信息搜索能力-
3. Efficient Top-k Document Retrieval for Long Queries Using Term-Document Binary Matrix: Pursuit of Enhanced Informational Search on the Web [J] . Etsuro Fujita, Keizo Oyama IEICE Transactions on Information and Systems . 2013,第5期

机译：使用术语文档二进制矩阵对长查询进行有效的Top-k文档检索：追求增强的Web信息搜索
4. Approximating Document Frequency for Self-Index based Top-k Document Retrieval [C] . Suzuki Tokinori, Fujii Atsushi IEEE International Conference on Advanced Information Networking and Applications . 2015

机译：基于自索引的Top-k文档检索的近似文档频率
5. Top-k with diversity-m data retrieval in Wireless Sensor Networks [D] . Puram, Kiran Kumar 2014

机译：无线传感器网络中具有多样性-m数据检索的Top-k
6. Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs [O] . Xiaofeng Yang, Patrick K. Nicholson, Deepak Ajwani, -1

机译：Any-k：随时在标签图中检索前k个树型
7. Elias-Fano meets Single-Term Top-k Document Retrieval [O] . Julian Labeit, Simon Gog 2017

机译：elias-fano符合单词Top-K文件检索

The Quantile Index - Succinct Self-Index for Top-k Document Retrieval

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅