首页> 外文期刊>Information Processing & Management >Improving semistatic compression via phrase-based modeling
【24h】

Improving semistatic compression via phrase-based modeling

机译:通过基于短语的建模改善半静态压缩

获取原文
获取原文并翻译 | 示例
       

摘要

In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet.In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms.
机译:近年来,新的基于半静态词的基于字节的面向字节的文本压缩器(例如Tagged Huffman和基于Dense Codes的压缩器)已经表明,可以对压缩文本执行快速直接搜索,并可以对压缩为大约是原始大小的30-35%。它们的成功很大程度上归功于将单词用作源符号和面向字节的目标字母。这种方法与传统的统计压缩器不同,后者使用字符作为源符号和面向位的目标字母。在这项工作中,我们将短语用作源符号超越了下一步。我们介绍了两个新的半静态建模器,我们将它们与密集编码方案结合起来,获得了两个新的压缩器:基于对的结尾标记密集代码(PETDC),其中源符号可以是单词或单词对,以及基于短语的结尾标记密集代码(PhETDC),它考虑单词和单词序列(短语)。 PETDC将英文文本压缩到28-29%,将PhETDC压缩到大约23%,比基于字节的最佳零位无前缀单词半静态压缩器的最佳性能高8个百分点。而且,PETDC和PhETDC仍然允许使用快速的Boyer-Moore算法进行随机访问和有效的直接搜索。

著录项

  • 来源
    《Information Processing & Management》 |2011年第4期|p.545-559|共15页
  • 作者单位

    Database Lab, Facultade de Informdtica, University of A Coruiia, Campus de Elvffla s, 15071 A Corufla, Spain;

    Database Lab, Facultade de Informdtica, University of A Coruiia, Campus de Elvffla s, 15071 A Corufla, Spain;

    Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile;

    Database Lab, Facultade de Informdtica, University of A Coruiia, Campus de Elvffla s, 15071 A Corufla, Spain;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    text compression; direct search;

    机译:文本压缩;直接搜索;
  • 入库时间 2022-08-17 23:20:20

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号