LZ77 Computation Based on the Run-Length Encoded BWT

Alberto Policriti; Nicola Prezza

首页> 外文期刊>Algorithmica >LZ77 Computation Based on the Run-Length Encoded BWT

【24h】

LZ77 Computation Based on the Run-Length Encoded BWT

机译：基于游程编码的BWT的LZ77计算

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Computing the LZ77 factorization is a fundamental task in text compression and indexing, being the size z of this compressed representation closely related to the self-repetitiveness of the text. A long-standing problem is to compute LZ77 using small working space. Considering that $$mathcal {O}(z)$$ O ( z ) words of space can be significantly (up to exponentially ) smaller than the size n of the input text, even succinct and entropy-compressed solutions are often unduly memory demanding. In this work we focus on an important measure of text repetitiveness: the number $$r$$ r of equal-letter runs in the Burrows–Wheeler transform of the reversed input text. As z , the measure $$r$$ r is closely related to the number of repetitions in the text and can be exponentially smaller than n . We describe two algorithms computing LZ77 in $$mathcal {O}(rlog n)$$ O ( r log n ) bits of working space and $$mathcal {O}(nlog r)$$ O ( n log r ) time. Roughly speaking, our algorithms store a constant number of memory words per BWT run to keep track of first-last run-positions and a suitable indexing mechanism to sample the runs of the BWT (instead of its positions). Important consequences of our results include (i) the possibility to convert from RLBWT- to LZ77-based compressed formats without first decompressing the text, and (ii) the existence of asymptotically-optimal construction algorithms for repetition-aware self-indexes based on these compression techniques. We finally describe an implementation of our solutions and present extensive experiments on highly repetitive datasets. Our algorithms use a working space as small as 1% of the dataset size and are two to three orders of magnitude more space-efficient (albeit slower) than existing solutions based, respectively, on entropy compression and suffix arrays.

机译：计算LZ77分解是文本压缩和索引编制中的一项基本任务，因为此压缩表示的大小z与文本的自我重复性密切相关。一个长期存在的问题是使用较小的工作空间来计算LZ77。考虑到$$ mathcal {O}（z）$$ O（z）个空间词可以比输入文本的大小n显着（最大为指数）小，即使简洁和熵压缩的解决方案也常常对内存有过分的要求。在这项工作中，我们着重研究文本重复性的一个重要指标：等号的数量$ r $$ r在反向输入文本的Burrows-Wheeler变换中运行。作为z的量度$$ r $$ r与文本中的重复次数密切相关，并且可以成倍地小于n。我们描述了两种在工作空间的$ {ma} cal {O}（rlog n）$$ O（r log n）位和$$ mathcal {O}（nlog r）$$ O（n log r）时间中计算LZ77的算法。粗略地说，我们的算法在每次BWT运行中存储恒定数量的存储字以跟踪首尾运行位置，并使用合适的索引机制对BWT的运行（而不是其位置）进行采样。我们的结果的重要结果包括：（i）无需先对文本进行解压缩就可以从RLBWT转换为基于LZ77的压缩格式的可能性，以及（ii）基于这些的渐近最优自我索引的渐近最优构造算法的存在压缩技术。最后，我们描述了我们解决方案的实现，并针对高度重复的数据集进行了广泛的实验。我们的算法使用的工作空间仅为数据集大小的1％，并且与分别基于熵压缩和后缀数组的现有解决方案相比，其空间效率比现有解决方案高2到3个数量级（尽管速度较慢）。

著录项

来源
《Algorithmica》 |2018年第7期|1986-2011|共26页
作者
Alberto Policriti; Nicola Prezza;
展开▼
作者单位

Department of Informatics, Mathematics, and Physics, University of Udine,Applied Genomics Institute;

Department of Informatics, Mathematics, and Physics, University of Udine;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Run-length encoded BWT; Lempel–Ziv factorization; Repetitive text collections; Repetition-aware data structures;

机译：游程编码的BWT;Lempel-Ziv分解;重复的文本集合;可识别重复的数据结构;

相似文献

外文文献
中文文献
专利

1. From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back [J] . Alberto Policriti, Nicola Prezza LIPIcs : Leibniz International Proceedings in Informatics . 2017,第30期

机译：从LZ77到流量长度编码的洞穴轮转器变换，然后返回
2. Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding [J] . Bastien Cazaux, Eric Rivals LIPIcs : Leibniz International Proceedings in Informatics . 2019,第29期

机译：通过Aho-Corasick自动机链接BWT和XBW：行程编码的应用程序
3. Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods [J] . Seung-Cheol Kim, Eun-Soo Kim Applied optics . 2009,第6期

机译：使用游程编码和新颖的查找表方法快速计算3D对象的全息图图案
4. COMPARISON OF TWO ASCII ART EXTRACTION METHODS: A RUN-LENGTH ENCODING BASED METHOD AND A BYTE PATTERN BASED METHOD [C] . Tetsuya SUZUKI Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control . 2015

机译：两种ASCII艺术提取方法的比较：基于游程长度编码的方法和基于字节模式的方法
5. Empirical analysis of BWT-based lossless image compression. [D] . Bhupathiraju, Kalyan Varma. 2010

机译：基于BWT的无损图像压缩的经验分析。
6. Run-length encoding graphic rules biochemically editable designs and steganographical numeric data embedment for DNA-based cryptographical coding system [O] . Tomonori Kawano 2013

机译：行程编码图形规则可生物化学编辑的设计以及基于DNA的密码编码系统的隐写数字数据嵌入
7. LZ77 Computation Based on the Run-Length Encoded BWT [O] . Policriti, Alberto, Prezza, Nicola 2017

机译：基于游程编码的BWT的LZ77计算
8. Coding schemes for run-length information based on Poisson distribution [R] . Happ, W. W. 1968

机译：基于泊松分布的游程信息编码方案

LZ77 Computation Based on the Run-Length Encoded BWT

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅