首页> 外文会议> >A bi-level Bernoulli scheme for database sampling
【24h】

A bi-level Bernoulli scheme for database sampling

机译:用于数据库采样的双层Bernoulli方案

获取原文

摘要

Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed.
机译:当前的数据库采样方法在处理ISO样式的采样查询时使用户无法充分控制。为了解决这个问题,我们提供了一种双级伯努利采样方案,该方案结合了当前在大多数商业系统中使用的行级和页面级采样方法。通过调整方法的参数,用户可以系统地权衡处理的速度和统计精度---的参数设置进行适当的选择变得查询优化问题。我们指出了支持双层采样并为具有显式时间或准确性约束的一类重要采样查询确定最佳参数设置所需的SQL扩展。可以预料,当每页上的数据值均一时,行级采样是可取的;而当页上的数据值变化很大时,应使用页级采样。也许令人惊讶的是,我们表明,在许多情况下,最佳采样策略是“爆炸式”类型:我们确定了“页面异质性指数”(PHI),使得如果PHI小于1,否则应尽可能“类似于页面”。 PHI取决于查询和数据,并且可以通过导频样本进行估计。由于试点采样在商业数据库系统中可能并非易事,因此,我们还提供了一种启发式方法来设置采样参数。该方法通过使用系统目录中维护的少量摘要统计信息来避免进行飞行员抽样。在372个真实和合成数据集上进行的1100多次实验的结果表明,启发式方法在大约一半的时间内表现最佳,并且在大约93%的时间的2.​​2倍内产生采样误差。启发式方法在广泛的采样率范围内是稳定的,并且在最关键的情况下(数据高度聚类或偏斜)表现最佳。

著录项

  • 来源
    《》|2004年|P.275-286|共12页
  • 会议地点
  • 作者

    Peter J. Haas; Christian Konig;

  • 作者单位
  • 会议组织
  • 原文格式 PDF
  • 正文语种
  • 中图分类 TP274.23;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号