首页> 中文期刊> 《计算机工程与应用》 >编码单位可变的倒排索引压缩算法研究

编码单位可变的倒排索引压缩算法研究

         

摘要

倒排索引是大多数大型文本搜索系统的核心数据结构,索引压缩可以有效地减少倒排索引的空间占用,提升检索效率.针对倒排索引压缩算法中的字节对齐编码进行研究,对于其压缩率不够优秀的问题,提出了分区可变单位编码(PVU编码).算法以可变单位方式代替固定字节存储,使实际存储空间更加贴合原码长度,从而提高压缩效果.针对序列均匀分区并非最优分区的问题,提出将最优分区问题转化为图论中最短路径问题的方法,使用Dijk-stra算法求解序列的最优编码分区.通过对比实验验证了改进优化的PVU编码相较于传统的字节对齐编码能够更好地压缩倒排索引序列.%The inverted index is the data structure at the core of most large-scale search systems. Index compression can effectively reduce the space occupied by inverted index and improve retrieval efficiency. To solve the problem of poor compression of byte-aligned coding, Partitioned Variable Unit(PVU)coding is proposed. Instead of storing multiple bytes, PVU stores integers in multiple units. Units include multiple values. This method makes the storage space more suitable for the original code length and has a better compression effect. To solve the optimal partition problem, the partition problem is transformed into the shortest path problem in graph theory. Dijkstra’s algorithm is designed to solve this problem. Experiments show that the optimized PVU can compress the inverted index sequence better than the traditional byte-aligned coding.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号