Document retrieval on repetitive string collections

机译：重复字符串集合的文档检索

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple tf - idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

机译：如今，增长最快的字符串集合大多数都是重复性的，也就是说，大多数构成文档与许多其他文档相似。随着这些集合的不断增长，处理它们的关键方法是利用它们的重复性，这可以将其空间使用量减少几个数量级。我们研究为重复的字符串集合建立索引的问题，以便对它们执行有效的文档检索操作。搜索引擎通常会在大型自然语言集合上解决文档检索问题，但在通用字符串集合上开发的技术较少。对于重复字符串收集的情况了解甚少，并且现有解决方案很少。我们提出了两种新颖的思想，即交错的LCP和预先计算的文档列表，它们产生了高度压缩的索引，从而解决了文档列表的问题（在出现字符串的位置查找所有文档），前k个文档检索（在出现字符串的位置最多查找k个文档）经常）和文件计数（计算出现字符串的文件数）。我们还表明，支持后一种查询的经典数据结构在重复数据上变得高度可压缩。最后，我们展示了如何在简单的tf-idf相关性模型下组合开发的工具来解决排序的合取和析取的多项查询。我们会在各种现实生活中的重复性场景中彻底评估所产生的技术，并针对每种情况推荐最佳选择。

著录项

期刊名称 Springer Open Choice
作者
Travis Gagie; Aleksi Hartikainen; Kalle Karhu; Juha Kärkkäinen; Gonzalo Navarro; Simon J. Puglisi; Jouni Sirén;
展开▼
作者单位

展开▼
年(卷),期 -1(20),3
年度 -1
页码 253–291
总页数 39
原文格式 PDF
正文语种
中图分类外科学;
关键词
Repetitive string collections Document retrieval on strings Suffix trees and arrays;

机译：重复的字符串集合;字符串;后缀树和数组的文档检索;

相似文献

外文文献
中文文献
专利

1. Document retrieval on repetitive string collections [J] . Gagie Travis, Hartikainen Aleksi, Karhu Kalle, Information retrieval . 2017,第3期

机译：重复字符串集合的文档检索
2. On the reproducibility of experiments of indexing repetitive document collections [J] . Farina Antonio, Martinez-Prieto Miguel A., Claude Francisco, Information Systems . 2019,第JULa期

机译：关于索引重复性文档集的实验的可重复性
3. On the reproducibility of experiments of indexing repetitive document collections [J] . Farina Antonio, Martinez-Prieto Miguel A., Claude Francisco, Information Systems . 2019,第Jula期

机译：论索引重复文件收集实验的再现性
4. Document Retrieval on Repetitive Collections [C] . Gonzalo Navarro, Simon J. Puglisi, Jouni Siren Annual European symposium on algorithms . 2014

机译：重复馆藏文献检索
5. Combinatoric models of information retrieval ranking methods and performance measures for weakly-ordered document collections. [D] . Church, Lewis. 2010

机译：信息检索排序方法和性能度量的组合模型，用于弱序文档收集。
6. Repetitive Transcranial Magnetic Stimulation Improved Source Memory and Modulated Recollection-Based Retrieval in Healthy Older Adults [O] . Xiaoyu Cui, Weicong Ren, Zhiwei Zheng, 2020

机译：重复的经颅磁刺激改善了源存储器并在健康老年人中调制了基于回忆的检索
7. Document retrieval on repetitive string collections [O] . Travis Gagie, Aleksi Hartikainen, Kalle Karhu, 2017

机译：重复字符串集合的文档检索
8. RETRIEVAL SYSTEMS FOR NON-STATIC DOCUMENT COLLECTIONS. [R] . hillman,donald j. 1965

机译：非静态文件收集的检索系统。

Document retrieval on repetitive string collections

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅