...
首页> 外文期刊>Information retrieval >Lightweight natural language text compression
【24h】

Lightweight natural language text compression

机译:轻量级自然语言文本压缩

获取原文
获取原文并翻译 | 示例

摘要

Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60% faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
机译:以单词作为源符号的霍夫曼代码变体是当前压缩自然语言文本数据库的最有吸引力的选择。特别是Moura等人的Tagged Huffman Code。提供对压缩文本的快速直接搜索和随机访问功能,以换取产生大约11%的更大压缩文件。这项工作描述了末端标记的密集代码和(s,c)-密集代码,这两种新的用于压缩自然语言文本的半静态统计方法。与Tagged Huffman码相比,这些技术允许更简单,更快速的编码并获得更好的压缩率,同时保持其快速直接搜索和随机访问功能。我们显示,密集代码将标记霍夫曼代码的压缩率提高了约10%,仅比最佳霍夫曼压缩率达到了0.6%的开销。简单来说,密集代码的生成速度比霍夫曼代码快45%至60%。出于各种原因,这使得密集代码成为霍夫曼代码变体的非常有吸引力的替代品:与最佳霍夫曼变体相比,它们更易于编程,构建速度更快,具有几乎最佳的大小,并且搜索速度和便捷程度最高。最佳尺寸。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号