首页> 外文期刊>Algorithmica >LZ77 Computation Based on the Run-Length Encoded BWT
【24h】

LZ77 Computation Based on the Run-Length Encoded BWT

机译:基于游程编码的BWT的LZ77计算

获取原文
获取原文并翻译 | 示例

摘要

Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A long-standing problem is to compute LZ77 using small working space. Considering that $$mathcal {O}(z)$$ O ( z ) words of space can be significantly (up to exponentially ) smaller than the size n of the input text, even succinct and entropy-compressed solutions are often unduly memory demanding. In this work we focus on an important measure of text repetitiveness: the number $$r$$ r of equal-letter runs in the Burrows–Wheeler transform of the reversed input text. As z , the measure $$r$$ r is closely related to the number of repetitions in the text and can be exponentially smaller than n . We describe two algorithms computing LZ77 in $$mathcal {O}(rlog n)$$ O ( r log n ) bits of working space and $$mathcal {O}(nlog r)$$ O ( n log r ) time. Roughly speaking, our algorithms store a constant number of memory words per BWT run to keep track of first-last run-positions and a suitable indexing mechanism to sample the runs of the BWT (instead of its positions). Important consequences of our results include (i) the possibility to convert from RLBWT- to LZ77-based compressed formats without first decompressing the text, and (ii) the existence of asymptotically-optimal construction algorithms for repetition-aware self-indexes based on these compression techniques. We finally describe an implementation of our solutions and present extensive experiments on highly repetitive datasets. Our algorithms use a working space as small as 1% of the dataset size and are two to three orders of magnitude more space-efficient (albeit slower) than existing solutions based, respectively, on entropy compression and suffix arrays.
机译:计算LZ77分解是文本压缩和索引编制中的一项基本任务,因为此压缩表示的大小z与文本的自我重复性密切相关。一个长期存在的问题是使用较小的工作空间来计算LZ77。考虑到$$ mathcal {O}(z)$$ O(z)个空间词可以比输入文本的大小n显着(最大为指数)小,即使简洁和熵压缩的解决方案也常常对内存有过分的要求。在这项工作中,我们着重研究文本重复性的一个重要指标:等号的数量$ r $$ r在反向输入文本的Burrows-Wheeler变换中运行。作为z的量度$$ r $$ r与文本中的重复次数密切相关,并且可以成倍地小于n。我们描述了两种在工作空间的$ {ma} cal {O}(rlog n)$$ O(r log n)位和$$ mathcal {O}(nlog r)$$ O(n log r)时间中计算LZ77的算法。粗略地说,我们的算法在每次BWT运行中存储恒定数量的存储字以跟踪首尾运行位置,并使用合适的索引机制对BWT的运行(而不是其位置)进行采样。我们的结果的重要结果包括:(i)无需先对文本进行解压缩就可以从RLBWT转换为基于LZ77的压缩格式的可能性,以及(ii)基于这些的渐近最优自我索引的渐近最优构造算法的存在压缩技术。最后,我们描述了我们解决方案的实现,并针对高度重复的数据集进行了广泛的实验。我们的算法使用的工作空间仅为数据集大小的1%,并且与分别基于熵压缩和后缀数组的现有解决方案相比,其空间效率比现有解决方案高2到3个数量级(尽管速度较慢)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号