首页> 外文会议>2017 International Conference on Big Data Innovations and Applications >Practical String Dictionary Compression Using String Dictionary Encoding
【24h】

Practical String Dictionary Compression Using String Dictionary Encoding

机译:使用字符串字典编码的实用字符串字典压缩

获取原文
获取原文并翻译 | 示例

摘要

A string dictionary is a data structure for storing a set of strings that maps them to unique IDs. It can manage string data in compact space by encoding them into integers. However, instances have recently emerged in practice where the size of string dictionaries has become a critical problem for very large datasets in many applications. A number of compressed string dictionaries have been proposed as a solution. In particular, the application of Re-Pair, a powerful text compression technique, to tries and front coding can help to obtain compact string dictionaries that support fast dictionary operations. However, the cost of constructing such dictionaries using Re-Pair is impractical for large datasets. In this paper, we propose an alternative compression strategy using string dictionary encoding and develop several dictionary structures for it. We show that our string dictionaries can be constructed up to 422.5× faster than the Re-Pair versions with competitive space and operation speed, through experiments on real-world datasets.
机译:字符串字典是一种数据结构,用于存储将字符串映射到唯一ID的一组字符串。通过将字符串编码为整数,它可以在紧凑的空间中管理字符串数据。但是,最近在实践中出现了这样的实例,其中字符串字典的大小已成为许多应用程序中非常大的数据集的关键问题。已经提出了许多压缩的字符串字典作为解决方案。尤其是,将Re-Pair(一种强大的文本压缩技术)应用于尝试和前端编码可以帮助获得支持快速字典操作的紧凑型字符串字典。但是,对于大型数据集,使用Re-Pair构造此类词典的成本不切实际。在本文中,我们提出了一种使用字符串字典编码的替代压缩策略,并为此开发了几种字典结构。通过对真实数据集的实验,我们证明了我们的字符串字典可以比Re-Pair版本快422.5倍,并且具有竞争空间和运算速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号