With the increasing of data scale,big data management is great significant.Underlying the popular mathematical models,probabilistic model is suitable for big data management since it could compress volume of data into a few probabilistic data.Therefore,it is significant for studying the problem of probabilistic data management over big data environment.As a classic query,range query over probabilistic data has been fully studied.However,the state of art efforts are not suitable since they all suffer from highly updating cost.In this paper,we propose a novel index named HGD-Tree for solving this problem.First of all,we propose a group of novel strategies for handling newly arrival objects.In this way,we could efficiently apply the insertion,deletion, and updating on the premise of balancing tree structure.In addition,we propose a novel partition-based structure to approach the probability density function of object,where the structure could self-adjust the partition resolution so as to cater for the underlying of uncertain data.Besides,our proposed structure is expressed by a few bit vectors.The above two strategies guarantee low space cost of the proposed index.Last but not least,we propose a novel algorithm for supporting the range query which could effectively apply the pruning under few bitwise operations.Theoretical analysis and extensive experimental results demonstrate the effectiveness of the proposed algorithms.%随着数据规模的不断增长,大数据管理具有重要意义.在众多数学模型中,因为概率模型可以将海量数据抽象成少量概率数据,所以它非常适合管理大数据.因此,研究大数据环境下的概率数据管理具有重要意义.作为一种经典查询,基于概率数据的范围查询已被深入研究.然而,当前研究成果不适合在大数据环境下使用.其根本原因是这些索引的更新代价较大.该文提出了索引 HGD-Tree 解决这一问题.首先,该文提出了一系列算法降低新增数据的处理代价.它可以保证树结构平衡的前提下快速地执行插入、删除、更新等操作.其次,该文提出了一种基于划分的方法构建概率对象的概要信息.它可以根据概率密度函数的特点自适应地执行划分.此外,由于作者提出的概要是基于比特向量,上述策略可以保证索引以较低空间代价管理概率数据.最后,该文提出了一种基于位运算的方法访问 HGD-Tree.它可以用少量的位运算执行过滤操作.大量的实验验证了算法的有效性.
展开▼