...
首页> 外文期刊>Information and computation >Space-efficient construction of Lempel-Ziv compressed text indexes Diego Arroyuelo
【24h】

Space-efficient construction of Lempel-Ziv compressed text indexes Diego Arroyuelo

机译:Lempel-Ziv压缩文本索引的空间高效构造Diego Arroyuelo

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. This is very important nowadays, since one can accommodate the index of very large texts entirely in main memory, avoiding the slower access to secondary storage. In particular, the LZ-index [G. Navarro, Indexing text using the Ziv-Lempel trie, Journal of Discrete Algorithms (JDA) 2 (1) (2004) 87-114] stands out for its good performance at extracting text passages and locating pattern occurrences. Given a text T[l..u] over an alphabet of size a, the LZ-index requires 4|LZ|(1 +o(l)) bits of space, where |LZ| is the size of the LZ78-compression of T. This can be bounded by LZ = uH——k(T) + o(u logσ), where Hk(T) is the k-th order empirical entropy of T, for any k = o(log_σ u). The LZ-index is built in O(u log a) time, yet requiring O(ulogu) bits of main memory in the worst case. In practice, the LZ-index occupies 1.0-1.5 times the text size (and replaces the text), but its construction requires around 5 times the text size. This limits its applicability to medium-sized texts. In this paper we present a space-efficient algorithm to construct the LZ-index in O(u(logcr + loglogu)) time and requiring 4|iZ|(l +o(l)) bits of main memory, that is, asymptotically the same space of the final index. We also adapt our algorithm to construct more recent reduced versions of the LZ-index, which occupy from 1 to 3 times |LZ|(1 +o(l)) bits, and show that these can also be built using asymptotically the same space of the final index. Finally, we study an alternative model in which we are given only a limited amount of main memory to carry out the indexing process (less than that required by the final index), and must use the disk for the rest. We show how to build all the LZ-index variants in O(u(logcr + loglogu)) time, and within |LZ|(1 +o(l)) bits of main memory, that is, asymptotically just the space to hold the LZ78-compressed text. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index, and being competitive with the best construction times of other compressed indexes.
机译:压缩的全文本自我索引是一种数据结构,它替换文本,并为其提供索引访问,同时占用与压缩的文本大小成比例的空间。如今,这非常重要,因为可以将非常大的文本的索引完全容纳在主存储器中,从而避免了对二级存储的访问变慢。特别是LZ指数[G. Navarro,使用Ziv-Lempel Trie索引文本,离散算法杂志(JDA)2(1)(2004)87-114]在提取文本段落和定位模式出现方面表现出色。给定大小为a的字母上的文本T [l..u],LZ索引需要4 | LZ |(1 + o(l))位空间,其中| LZ |是T的LZ78压缩的大小。可以用LZ = uH——k(T)+ o(ulogσ)来界定,其中Hk(T)是T的k阶经验熵,对于任何k = o(log_σu)。 LZ索引的建立时间为O(u log a),但在最坏的情况下需要主存储器的O(ulogu)位。实际上,LZ索引占用文本大小的1.0-1.5倍(并替换文本),但其构造需要大约5倍的文本大小。这限制了它对中型文本的适用性。在本文中,我们提出一种节省空间的算法,以在O(u(logcr + loglogu))的时间内构造LZ索引,并需要主存储器的4 | iZ |(l + o(l))位,即渐近地最终索引的空间相同。我们还调整了算法以构造LZ索引的最新简化版本,该版本占用了| LZ |(1 + o(l))位的1至3倍,并显示了它们也可以使用渐近相同的空间来构建最终索引的最后,我们研究一种替代模型,在该模型中,仅给我们有限的主内存来执行索引过程(少于最终索引所要求的数量),而其余部分必须使用磁盘。我们展示了如何在O(u(logcr + loglogu))的时间内,以及在主存储器的| LZ |(1 + o(l))位内构建所有LZ-index变体,即渐近地仅容纳空间LZ78压缩的文本。我们的实验结果表明,我们的方法在实践中是有效的,需要接近最终索引的内存量,并且与其他压缩索引的最佳构造时间相比具有竞争力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号