首页> 外文会议>Data Compression Conference >Word-based Statistical Compressors as Natural Language Compression Boosters
【24h】

Word-based Statistical Compressors as Natural Language Compression Boosters

机译:基于Word的统计压缩机作为自然语言压缩助推器

获取原文

摘要

Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone. Text indexing also profits from our preprocessing step. A compressed self-index achieves much better space and time performance when preceded by a semistatic word-based compression step. We show, for example, that the AF-FMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts.
机译:已知基于词基的基于字节的压缩代码是有吸引力的替代方法来压缩自然语言文本。压缩率约为30%,它们允许直接在压缩文本上搜索,而不是在未压缩的版本上快8倍。在本文中,我们揭示这些压缩机具有更多的益处。我们展示了大多数最先进的压缩机,如块WISE BZIP2,来自ZIV-LEMPEL系列的那些,以及预测的PPM基于基于PPM的压缩机可以受益于压缩而不是原始文本,而是它的压缩由基于词的字节的统计压缩机获得的表示。特别是,我们的实验结果表明,使用基于密码的压缩作为预处理的步骤,如BZIP2,GZIP或PPMDI等经典压缩机,产生了几个重要的益处。例如,已知PPM系列以实现最佳压缩比。通过密集的编码预处理,PPMDI实现了更好的压缩比(我们对自然语言的最佳信息)和单独的PPMDI更快的压缩/解压缩。文本索引也从我们的预处理步骤中获利。压缩自我指数在基于半字基的压缩步骤之前实现了更好的空间和时间性能。例如,我们展示了与标记的霍夫曼编码耦合的AF-FMIndex是自然语言文本的有吸引力的替代指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号