We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |σ| + polylog(n)) time. The term Hh ≤ lg |σ| denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results andtradeoffs are reported in the paper.
展开▼
机译:我们提出了一种压缩后缀数组的新颖实现,它显示了字母σ上 n I>个给定文本(或序列)的给定文本(或序列)的搜索时间与空间占用之间的新折衷,其中每个符号由lg&verbar;σ&verbar编码;位。我们显示压缩后缀数组仅使用nH h INF> +σ位,同时保留全文本索引功能,例如在 O < / I>( m I> lg&verbar;σ&verbar; + polylog( n I>))时间。项H h INF>≤lg&verbar;σ&verbar;表示文本的h阶经验熵,这意味着我们的索引在空间上除低阶术语外几乎都是最佳的,从而渐近地实现了文本的经验熵(乘数为1)。如果文本具有高度可压缩性,使得H n INF> = o(1)并且字母大小较小,则我们将获得搜索时间为o(m)且仅需要o(n)位的文本索引。本文报道了进一步的结果和权衡。
展开▼