首页> 外文会议>International conference on very large data bases;VLDB 2008 >Approximate Lineage for Probabilistic Databases
【24h】

Approximate Lineage for Probabilistic Databases

机译:概率数据库的近似谱系

获取原文

摘要

In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set.
机译:在概率数据库中,沿袭是查询处理和理解数据的基础。当前系统Trio或Mystiq使用一种完整的方法,其中元组t的谱系是一个布尔公式,表示t的所有派生。在大型数据库中,谱系公式会变得非常庞大:在一个公共数据库(基因本体论)中,我们经常观察到单个元组的10MB谱系(来源)数据。在本文中,我们建议使用近似谱系,该谱系是一个较小的公式,仅跟踪最重要的派生,系统可以使用这些派生来处理查询并提供解释。我们详细讨论了两种特定的近似谱系:(1)称为足够谱系的保守近似,它记录了每个元组的最重要派生,以及(2)多项式谱系,它更具攻击性并且可以提供更高的压缩率,并且基于布尔表达式的傅立叶近似。在本文中,我们正式定义了近似谱系,描述了计算近似谱系并正式证明其误差范围的算法,并在真实数据集上通过实验验证了我们的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号