首页> 外文期刊>Computer Journal, The >Boosting Text Compression with Word-Based Statistical Encoding1
【24h】

Boosting Text Compression with Word-Based Statistical Encoding1

机译:使用基于单词的统计编码促进文本压缩 1

获取原文
获取原文并翻译 | 示例
           

摘要

Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30–35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.
机译:基于半静态字的基于字节的压缩器是压缩自然语言文本的有吸引力的替代方法。压缩率约为30–35%,因此可以快速直接搜索压缩文本。在本文中,我们揭示了这些压缩机还有更多好处。我们显示出,大多数最新的压缩器都受益于压缩,而不是压缩原始文本,而是压缩由基于字的面向字节的统计压缩器获得的压缩表示。例如,与单独的p7zip相比,具有密集编码预处理的p7zip可获得更高的压缩率和更快的压缩率。在典型的大型英文文本中,我们达到的压缩率低于17%,这只能通过部分匹配的压缩器的缓慢预测来获得。此外,如果最终的压缩程序对基于单词的压缩文本进行操作,则搜索的执行速度将大大提高。我们证明了典型的自我索引也从我们的预处理步骤中受益。当索引在压缩步骤之后时,它们可以获得更好的空间和时间性能。除了使用著名的Tagged Huffman代码外,我们还提供了一种新的无后缀的基于Dense-Code的压缩器,压缩效果略好。我们还将展示一些自索引如何处理不带后缀的代码。结果,压缩/索引文本需要原始文本空间的35%左右,并允许对单词和短语进行索引搜索。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号