Space-efficient construction of Lempel-Ziv compressed text indexes Diego Arroyuelo

Gonzalo Navarro

首页> 外文期刊>Information and computation >Space-efficient construction of Lempel-Ziv compressed text indexes Diego Arroyuelo

【24h】

Space-efficient construction of Lempel-Ziv compressed text indexes Diego Arroyuelo

机译：Lempel-Ziv压缩文本索引的空间高效构造Diego Arroyuelo

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Indexing text using the Ziv-Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87-114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[l..u] over an alphabet of size a, the LZ-index requires 4|LZ|(1 +o(l)) bits of space, where |LZ| is the size of the LZ78-compression of T. This can be bounded by LZ = uH——k(T) + o(u logσ), where Hk(T) is the k-th order empirical entropy of T, for any k = o(log_σ u). The LZ-index is built in O(u log a) time, yet requiring O(ulogu) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0-1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in O(u(logcr + loglogu)) time and requiring 4|iZ|(l +o(l)) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times |LZ|(1 +o(l)) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in O(u(logcr + loglogu)) time, and within |LZ|(1 +o(l)) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.

机译：压缩的全文本自我索引是一种数据结构，它替换文本，并为其提供索引访问，同时占用与压缩的文本大小成比例的空间。如今，这非常重要，因为可以将非常大的文本的索引完全容纳在主存储器中，从而避免了对二级存储的访问变慢。特别是LZ指数[G. Navarro，使用Ziv-Lempel Trie索引文本，离散算法杂志（JDA）2（1）（2004）87-114]在提取文本段落和定位模式出现方面表现出色。给定大小为a的字母上的文本T [l..u]，LZ索引需要4 | LZ |（1 + o（l））位空间，其中| LZ |是T的LZ78压缩的大小。可以用LZ = uH——k（T）+ o（ulogσ）来界定，其中Hk（T）是T的k阶经验熵，对于任何k = o（log_σu）。 LZ索引的建立时间为O（u log a），但在最坏的情况下需要主存储器的O（ulogu）位。实际上，LZ索引占用文本大小的1.0-1.5倍（并替换文本），但其构造需要大约5倍的文本大小。这限制了它对中型文本的适用性。在本文中，我们提出一种节省空间的算法，以在O（u（logcr + loglogu））的时间内构造LZ索引，并需要主存储器的4 | iZ |（l + o（l））位，即渐近地最终索引的空间相同。我们还调整了算法以构造LZ索引的最新简化版本，该版本占用了| LZ |（1 + o（l））位的1至3倍，并显示了它们也可以使用渐近相同的空间来构建最终索引的最后，我们研究一种替代模型，在该模型中，仅给我们有限的主内存来执行索引过程（少于最终索引所要求的数量），而其余部分必须使用磁盘。我们展示了如何在O（u（logcr + loglogu））的时间内，以及在主存储器的| LZ |（1 + o（l））位内构建所有LZ-index变体，即渐近地仅容纳空间LZ78压缩的文本。我们的实验结果表明，我们的方法在实践中是有效的，需要接近最终索引的内存量，并且与其他压缩索引的最佳构造时间相比具有竞争力。

著录项

来源
《Information and computation》 |2011年第7期|p.1070-1102|共33页
作者
Gonzalo Navarro;
展开▼
作者单位

Yahool Research Latin America, Blanco Encalada 2120, Santiago, Chile;

Dept. of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
full-text indexing; compressed data structures; text databases; lempel-ziv compression;

机译：全文索引;压缩数据结构;文本数据库;lempel-ziv压缩;

相似文献

外文文献
中文文献
专利

1. Stronger Lempel-Ziv Based Compressed Text Indexing [J] . Diego Arroyuelo, Gonzalo Navarro, Kunihiko Sadakane Algorithmica . 2012,第1a2期

机译：基于Lempel-Ziv的更强大的压缩文本索引
2. Distribution-Aware Compressed Full-Text Indexes [J] . Paolo Ferragina, Jouni Siren, Rossano Venturini Algorithmica . 2013,第4期

机译：分发感知的压缩全文本索引
3. Compressed indexes for text with wildcards [J] . Chris Thachuk Theoretical computer science . 2013,第Null期

机译：带通配符的文本的压缩索引
4. Space-efficient genome comparisons with compressed full-text indexes [C] . Enno Ohlebusch, Simon Gog 2nd international conference on bioinformatics and computational biology 2010 . 2010

机译：利用压缩的全文本索引进行节省空间的基因组比较
5. Data analytics and visualization for enhanced highway construction cost indexes and as-built schedules [D] . Shrestha, Krishna Prasad. 2016

机译：增强公路施工成本指标和竣工时间表的数据分析与可视化
6. Construction of Cognitive Maps to Improve Reading Performance by Text Signaling: Reading Text on Paper Compared to on Screen [O] . Zifu Shi, Ting Tang, Lin Yin 2020

机译：认知地图的构建以提高文本信令的阅读性能：与屏幕相比读取文本
7. Space-efficient construction of Lempel-Ziv compressed text indexes [O] . Diego Arroyuelo, Gonzalo Navarro 2009

机译：Lempel-Ziv压缩文本索引的节省空间的构造
8. Comprehension and Analysis of Information in Text: I. Construction and Evaluation of Brief Texts [R] . Kozminsky, E., Bourne, L. E., Kintsch, W. 1979

机译：文本信息的理解与分析：一，简要文本的构建与评价

Space-efficient construction of Lempel-Ziv compressed text indexes Diego Arroyuelo

摘要

著录项

相似文献

相关主题

期刊订阅