首页> 外文会议>International Symposium on String Processing and Information Retrieval >Fast, Small, and Simple Document Listing on Repetitive Text Collections
【24h】

Fast, Small, and Simple Document Listing on Repetitive Text Collections

机译:快速,小巧和简单的重复性文本集合文档列表

获取原文

摘要

Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document, retrieval. In this paper we present a simple document listing index for repetitive string collections of total length n that lists the ndoc distinct documents where a pattern of length m appears in time O(m + ndoc • lg n). We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.
机译:字符串集合上的文档列表是查找出现模式的所有文档的任务。它被视为最基本的文档检索问题,在各种应用程序中很有用。许多增长最快的字符串集合是由非常相似的文档组成的,例如版本代码和文档集合,基因组存储库等。专为重复文本集合设计的纯模式匹配索引可减少空间数量级。取而代之的是,文档,检索没有很多类似的索引。在本文中,我们为总长度为n的重复字符串集合提供了一个简单的文档列表索引,该索引列出了ndoc个不同的文档,其中长度为m的模式出现在时间O(m + ndoc•lg n)中。我们利用文档数组的重复性(即将后缀数组粗化为文档标识符)进行语法压缩,同时预先计算非终结符的答案,并以语法压缩的形式存储它们。我们的实验结果表明,我们的索引大大优于时空折衷图中的现有替代方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号