首页> 外文会议>International Conference on Big Data Innovations and Applications >Practical String Dictionary Compression Using String Dictionary Encoding
【24h】

Practical String Dictionary Compression Using String Dictionary Encoding

机译:使用字符串字典编码的实用字符串字典压缩

获取原文

摘要

A string dictionary is a data structure for storing a set of strings that maps them to unique IDs. It can manage string data in compact space by encoding them into integers. However, instances have recently emerged in practice where the size of string dictionaries has become a critical problem for very large datasets in many applications. A number of compressed string dictionaries have been proposed as a solution. In particular, the application of Re-Pair, a powerful text compression technique, to tries and front coding can help to obtain compact string dictionaries that support fast dictionary operations. However, the cost of constructing such dictionaries using Re-Pair is impractical for large datasets. In this paper, we propose an alternative compression strategy using string dictionary encoding and develop several dictionary structures for it. We show that our string dictionaries can be constructed up to 422.5× faster than the Re-Pair versions with competitive space and operation speed, through experiments on real-world datasets.
机译:字符串字典是用于存储一组字符串的数据结构,该字符串将它们映射到唯一ID。它可以通过将它们编码为整数来管理紧凑型空间中的字符串数据。但是,最近在练习中出现了实例,其中字符串字典大小成为许多应用中非常大的数据集的关键问题。已经提出了许多压缩字符串词典作为解决方案。特别地,重组的应用,强大的文本压缩技术,尝试和前编码可以帮助获得支持快速字典操作的紧凑串字典。但是,使用重组构建此类词典的成本对于大型数据集是不切实际的。在本文中,我们提出了使用String字典编码的替代压缩策略,并为其开发几个字典结构。我们表明,我们的字符串字典可以构造高达422.5×比重新配对版本具有竞争力的空间和运算速度,通过对现实世界的数据集实验快。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号