首页> 美国卫生研究院文献>Springer Open Choice >Document retrieval on repetitive string collections
【2h】

Document retrieval on repetitive string collections

机译:重复字符串集合的文档检索

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple tf - idf model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.
机译:如今,增长最快的字符串集合大多数都是重复性的,也就是说,大多数构成文档与许多其他文档相似。随着这些集合的不断增长,处理它们的关键方法是利用它们的重复性,这可以将其空间使用量减少几个数量级。我们研究为重复的字符串集合建立索引的问题,以便对它们执行有效的文档检索操作。搜索引擎通常会在大型自然语言集合上解决文档检索问题,但在通用字符串集合上开发的技术较少。对于重复字符串收集的情况了解甚少,并且现有解决方案很少。我们提出了两种新颖的思想,即交错的LCP和预先计算的文档列表,它们产生了高度压缩的索引,从而解决了文档列表的问题(在出现字符串的位置查找所有文档),前k个文档检索(在出现字符串的位置最多查找k个文档)经常)和文件计数(计算出现字符串的文件数)。我们还表明,支持后一种查询的经典数据结构在重复数据上变得高度可压缩。最后,我们展示了如何在简单的tf-idf相关性模型下组合开发的工具来解决排序的合取和析取的多项查询。我们会在各种现实生活中的重复性场景中彻底评估所产生的技术,并针对每种情况推荐最佳选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号