首页> 外文会议>SIGMOD/PODS 2007 >Sketching Probabilistic Data Streams
【24h】

Sketching Probabilistic Data Streams

机译:绘制概率数据流

获取原文

摘要

The management of uncertain, probabilistic data has recently emerged as a useful paradigm for dealing with the inherent unreliabilities of several real-world application domains, including data cleaning, information integration, and pervasive, multi-sensor computing. Unlike conventional data sets, a set of probabilistic tuples de- fines a probability distribution over an exponential number of possible worlds (I.e., "grounded", deterministic databases). This "possible worlds" interpretation allows for clean query semantics but also raises hard computational problems for probabilistic database query processors. To further complicate matters, in many scenarios (e.g., large-scale process and environmental monitoring using multiple sensor modalities), probabilistic data tuples arrive and need to be processed in a streaming fashion; that is, using limited memory and CPU resources and without the benefit of multiple passes over a static probabilistic database. Such probabilistic data streams raise a host of new research challenges for stream-processing engines that, to date, remain largely unaddressed. In this paper, we propose the first space- and time-efficient algorithms for approximating complex aggregate queries (including, the number of distinct values and join/self-join sizes) over probabilistic data streams. Following the possible-worlds semantics, such aggregates essentially define probability distributions over the space of possible aggregation results, and our goal is to characterize such distributions through efficient approximations of their key moments (such as expectation and variance). Our algorithms offer strong randomized estimation guarantees while using only sublinear space in the size of the stream(s), and rely on novel, concise streaming sketch synopses that extend conventional sketching ideas to the probabilistic streams setting. Our experimental results verify the effectiveness of our approach.
机译:最近,不确定性,概率数据的管理已成为一种有用的范例,用于处理多个实际应用程序域的内在不可靠性,包括数据清理,信息集成以及普遍的多传感器计算。与常规数据集不同,一组概率元组定义了可能世界指数形式的概率分布(即确定性数据库)。这种“可能的世界”解释允许干净的查询语义,但也给概率数据库查询处理器带来了困难的计算问题。更复杂的是,在许多情况下(例如,使用多个传感器模式进行大规模过程和环境监控),概率数据元组到达并需要以流方式进行处理;也就是说,使用有限的内存和CPU资源,并且没有通过静态概率数据库进行多次传递的好处。这样的概率数据流给流处理引擎提出了许多新的研究挑战,到目前为止,这些挑战仍未解决。在本文中,我们提出了第一种节省时间和空间的算法,用于近似概率数据流上的复杂聚合查询(包括不同值的数量和连接/自连接大小)。遵循可能世界的语义,此类聚合实质上定义了可能的聚合结果空间上的概率分布,我们的目标是通过关键时刻(例如期望和方差)的有效近似来表征此类分布。我们的算法在仅使用流大小的亚线性空间的同时,提供了强大的随机估计保证,并依赖于新颖,简洁的流草图提要,这些提要将常规的草图绘制思想扩展到了概率流设置。我们的实验结果验证了我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号