Boosting Text Compression with Word-Based Statistical Encoding1

Antonio Fariña; Gonzalo Navarro; José R. Paramá

首页> 外文期刊>Computer Journal, The >Boosting Text Compression with Word-Based Statistical Encoding1

【24h】

Boosting Text Compression with Word-Based Statistical Encoding1

机译：使用基于单词的统计编码促进文本压缩 1

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30–35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.

机译：基于半静态字的基于字节的压缩器是压缩自然语言文本的有吸引力的替代方法。压缩率约为30–35％，因此可以快速直接搜索压缩文本。在本文中，我们揭示了这些压缩机还有更多好处。我们显示出，大多数最新的压缩器都受益于压缩，而不是压缩原始文本，而是压缩由基于字的面向字节的统计压缩器获得的压缩表示。例如，与单独的p7zip相比，具有密集编码预处理的p7zip可获得更高的压缩率和更快的压缩率。在典型的大型英文文本中，我们达到的压缩率低于17％，这只能通过部分匹配的压缩器的缓慢预测来获得。此外，如果最终的压缩程序对基于单词的压缩文本进行操作，则搜索的执行速度将大大提高。我们证明了典型的自我索引也从我们的预处理步骤中受益。当索引在压缩步骤之后时，它们可以获得更好的空间和时间性能。除了使用著名的Tagged Huffman代码外，我们还提供了一种新的无后缀的基于Dense-Code的压缩器，压缩效果略好。我们还将展示一些自索引如何处理不带后缀的代码。结果，压缩/索引文本需要原始文本空间的35％左右，并允许对单词和短语进行索引搜索。

著录项

来源
《Computer Journal, The》 |2012年第1期|p.111-131|共21页
作者
Antonio Fariña; Gonzalo Navarro; José R. Paramá;
展开▼
作者单位

Corresponding author:;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Boosting Text Compression with Word-Based Statistical Encoding [J] . Antonio Farina, Gonzalo Navarro, Jose R. Parama The Computer journal . 2012,第1期

机译：通过基于单词的统计编码提高文本压缩
2. Application of a Word-Based Text Compression Method to Japanese and Chinese Texts [J] . Shigeru YOSHIDA, Takashi MORIHARA, Hironori YAHAGI, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences . 2002,第12期

机译：基于单词的文本压缩方法在日语和中文文本中的应用
3. Multi-Stream Word-Based Compression Algorithm for Compressed Text Search [J] . Ozturk Emir, Mesut Altan, Diri Banu Arabian Journal for Science and Engineering . 2018,第12期

机译：基于多流词的压缩文本搜索算法
4. Word-based Statistical Compressors as Natural Language Compression Boosters [C] . Antonio Farina, Gonzalo Navarro, Jose R. Parama Data Compression Conference . 2008

机译：基于Word的统计压缩机作为自然语言压缩助推器
5. Experimental and Computational Investigation of Spark Assisted Compression Ignition Combustion under Boosted, Ultra EGR-Dilute Conditions [D] . Triantopoulos, Vasileios. 2018

机译：升压下火花辅助压缩点火燃烧的实验和计算研究，超EGR-稀释条件
6. Boosting Throughput and Efficiency of Hardware Spiking Neural Accelerators Using Time Compression Supporting Multiple Spike Codes [O] . Changqing Xu, Wenrui Zhang, Yu Liu, 2020

机译：使用时间压缩支撑多穗码的硬件尖峰神经加速器的吞吐量和效率
7. Boosting Text Compression with Word-based Statistical Encoding [O] . Antonio Fariña, Gonzalo Navarro, José R. Paramá 2012

机译：通过基于单词的统计编码提高文本压缩

Boosting Text Compression with Word-Based Statistical Encoding1

摘要

著录项

相似文献

相关主题

期刊订阅