【24h】

High-order entropy-compressed text indexes

机译:高阶熵压缩文本索引

获取原文

摘要

We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |σ| + polylog(n)) time. The term Hh ≤ lg |σ| denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results andtradeoffs are reported in the paper.
机译:我们提出了一种压缩后缀数组的新颖实现,它显示了字母σ上 n 个给定文本(或序列)的给定文本(或序列)的搜索时间与空间占用之间的新折衷,其中每个符号由lg&verbar;σ&verbar编码;位。我们显示压缩后缀数组仅使用nH h +σ位,同时保留全文本索引功能,例如在 O < / I>( m lg&verbar;σ&verbar; + polylog( n ))时间。项H h ≤lg&verbar;σ&verbar;表示文本的h阶经验熵,这意味着我们的索引在空间上除低阶术语外几乎都是最佳的,从而渐近地实现了文本的经验熵(乘数为1)。如果文本具有高度可压缩性,使得H n = o(1)并且字母大小较小,则我们将获得搜索时间为o(m)且仅需要o(n)位的文本索引。本文报道了进一步的结果和权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号