A bi-level Bernoulli scheme for database sampling

机译：用于数据库采样的双层Bernoulli方案

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed.

机译：当前的数据库采样方法在处理ISO样式的采样查询时使用户无法充分控制。为了解决这个问题，我们提供了一种双级伯努利采样方案，该方案结合了当前在大多数商业系统中使用的行级和页面级采样方法。通过调整方法的参数，用户可以系统地权衡处理的速度和统计精度---的参数设置进行适当的选择变得查询优化问题。我们指出了支持双层采样并为具有显式时间或准确性约束的一类重要采样查询确定最佳参数设置所需的SQL扩展。可以预料，当每页上的数据值均一时，行级采样是可取的;而当页上的数据值变化很大时，应使用页级采样。也许令人惊讶的是，我们表明，在许多情况下，最佳采样策略是“爆炸式”类型：我们确定了“页面异质性指数”（PHI），使得如果PHI小于1，否则应尽可能“类似于页面”。 PHI取决于查询和数据，并且可以通过导频样本进行估计。由于试点采样在商业数据库系统中可能并非易事，因此，我们还提供了一种启发式方法来设置采样参数。该方法通过使用系统目录中维护的少量摘要统计信息来避免进行飞行员抽样。在372个真实和合成数据集上进行的1100多次实验的结果表明，启发式方法在大约一半的时间内表现最佳，并且在大约93％的时间的2.2倍内产生采样误差。启发式方法在广泛的采样率范围内是稳定的，并且在最关键的情况下（数据高度聚类或偏斜）表现最佳。

著录项

来源
《》|2004年|P.275-286|共12页
会议地点
作者
Peter J. Haas; Christian Konig;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP274.23;
关键词

相似文献

外文文献
中文文献
专利

1. Enhanced sampling schemes for MCMC based blind Bernoulli-Gaussian deconvolution [J] . D. Ge, J. Idier, E. Le Carpentier Signal processing . 2011,第4期

机译：基于MCMC的盲Bernoulli-Gaussian反卷积的增强采样方案
2. An all-season sample database for improving land-cover mapping of Africa with two classification schemes [J] . Li Congcong, Gong Peng, Wang Jie, International journal of remote sensing . 2016,第19a20期

机译：全季节样本数据库，通过两种分类方案来改善非洲的土地覆盖制图
3. A bi-level encoding scheme for the clustered shortest-path tree problem in multifactorial optimization [J] . Huynh Thi Thanh Binh, Ta Bao Thang, Nguyen Duc Thai, Engineering Applications of Artificial Intelligence . 2021,第Apra期

机译：多因素优化中群集最短路径树问题的双级编码方案
4. A bi-level Bernoulli scheme for database sampling [C] . Peter J. Haas, Christian Konig ACM SIGMOD international conference on Management of data . 2004

机译：用于数据库采样的双级Bernoulli方案
5. An inverse problem for the Euler-Bernoulli equation and a new scheme for solving a hierarchical size-structured model with nonlinear growth, mortality, and reproduction rates. [D] . Marinov, Tchavdar. 2008

机译：Euler-Bernoulli方程的反问题和求解具有非线性增长，死亡率和繁殖率的分层大小结构模型的新方案。
6. The Correctness of the Simplified Bernoulli Trial (SBT) Collision Scheme of Calculations of Two-Dimensional Flows [O] . Kiril Shterev 2021

机译：简化Bernoulli试验（SBT）碰撞方案的二维流程计算的正确性
7. A Bi-Level Bernoulli Scheme for Database Sampling [O] . Peter J. Haas 2004

机译：用于数据库采样的双层Bernoulli方案

A bi-level Bernoulli scheme for database sampling

摘要

著录项

相似文献

相关主题

期刊订阅