首页> 外文学位 >Correlation-aware statistical methods for sampling-based group by estimates.

【24h】

Correlation-aware statistical methods for sampling-based group by estimates.

机译：基于采样的分组依据估计的相关感知统计方法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Over the last decade, Data Warehousing and Online Analytical Processing (OLAP) have gained much interest from industry because of the need for processing analytical queries for business intelligence and decision support. A typical analytical query may require long evaluation time because analytical queries are complicated, and because the datasets used to evaluate analytical queries are large. One key problem arising from long evaluation time is that no feedback is given until the query is fully evaluated. This is problematic for several reasons. First, this makes query debugging very difficult. Second, the long running time also discourages users to explore the data interactively. One way to speed up the evaluation time is to use approximate query processing techniques, such as sampling. Researchers have developed scalable approximate query processing techniques for SELECT-PROJECT-JOIN-AGGREGATE queries. However, most work has ignored GROUP BY queries. This is a significant hole in the state-of-the-art, since the GROUP BY query is an important type of OLAP query. For example, more than two thirds of the public TPC-H benchmark queries are GROUP BY queries. Running a GROUP BY query in an approximate query processing system requires the same sample to be used to estimate the result of each group, which induces correlations among the estimates. Thus from a statistical point of view, providing estimation information for a GROUP BY query without considering the correlations is problematic and probably misleading. In this thesis, I formally address this problem and provide correlation-aware statistical methods to answer sampling-based GROUP BY queries. I make three specific contributions to the state-of-the-art in this area. First, I formally characterize the correlations among the groupwise estimates. Second, I develop methods to provide correlation-aware simultaneous confidence bounds for GROUP BY queries. Finally I develop correlation-aware statistical methods to return all "top-k" groups with high probability when only database samples are available.

机译：在过去的十年中，由于需要为商务智能和决策支持处理分析查询，因此数据仓库和在线分析处理（OLAP）引起了业界的极大兴趣。典型的分析查询可能需要较长的评估时间，因为分析查询很复杂，并且用于评估分析查询的数据集很大。评估时间长导致的一个关键问题是，在对查询进行完全评估之前，不会给出任何反馈。这出于几个原因是有问题的。首先，这使查询调试非常困难。其次，运行时间长还阻碍了用户以交互方式浏览数据。加快评估时间的一种方法是使用近似查询处理技术，例如抽样。研究人员已经为SELECT-PROJECT-JOIN-AGGREGATE查询开发了可伸缩的近似查询处理技术。但是，大多数工作都忽略了GROUP BY查询。这是最新技术的一个重大漏洞，因为GROUP BY查询是OLAP查询的重要类型。例如，超过三分之二的公共TPC-H基准查询是GROUP BY查询。在近似查询处理系统中运行GROUP BY查询需要使用相同的样本来估计每个组的结果，这会导致估计之间的相关性。因此，从统计角度来看，在不考虑相关性的情况下为GROUP BY查询提供估计信息是有问题的，并且可能会产生误导。在本文中，我正式解决了这个问题，并提供了相关相关的统计方法来回答基于采样的GROUP BY查询。我对这一领域的最新技术做出了三点具体的贡献。首先，我正式描述了各组估计之间的相关性。其次，我开发了为GROUP BY查询提供相关感知的同时置信范围的方法。最后，我开发了相关感知统计方法，以在只有数据库样本可用时以高概率返回所有“前k个”组。

著录项

作者
Xu, Fei.;
展开▼
作者单位

University of Florida.;

展开▼
授予单位 University of Florida.;
学科 Computer Science.
学位 Ph.D.
年度 2009
页码 139 p.
总页数 139
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Homology-based method for identification of protein repeats using statistical significance estimates. [J] . Andrade MA, Ponting CP, Gibson TJ, Journal of Molecular Biology . 2000,第3期

机译：使用统计学显着性估计的基于同源性的蛋白质重复序列鉴定方法。
2. Spatial correlation-aware statistical dual-threshold voltage design of template-based asynchronous circuits [J] . Ghavami Behnam Compel . 2018,第3期

机译：基于模板的异步电路的空间相关感知统计双阈值电压设计
3. Multiple Dependent State Sampling-Based Chart Using Belief Statistic under Neutrosophic Statistics [J] . Ahmed Ibrahim Shawky, Muhammad Aslam, Khushnoor Khan Journal of Mathematics . 2020,第4期

机译：基于多个依赖状态采样的图表，使用中性学统计下的信仰统计
4. Timing yield and reliability improvement of carbon nano-tube FET based digital circuits with statistical driven correlation-aware placement [C] . Jalali Amir, Pedram Hossein 2013 IEEE 33rd International Scientific Conference on Electronics and Nanotechnology . 2013

机译：具有统计驱动的相关感知布局的基于碳纳米管FET的数字电路的时序产量和可靠性提高
5. Statistical methods for blending satellite and ground observations to improve high-resolution precipitation estimates. [D] . Verdin, Andrew P. 2013

机译：混合卫星和地面观测以改善高分辨率降水估算的统计方法。
6. Non-iterative sampling-based Bayesian methods for identifying changepoints in the sequence of cases of haemolytic uraemic syndrome [O] . Guo-Liang Tian, Kai Wang Ng, Kai-Can Li, -1

机译：基于非迭代采样的贝叶斯方法用于筛查血液解性血症综合征序列中的序列
7. Large-scale pan-european forest monitoring network: A statistical perspective for designing and combining country estimates. Example for defoliation [O] . Travaglini Davide, Chirici Gherardo, Bottalico Francesca, 2013

机译：大型泛欧洲森林监测网络：用于设计和合并国家估算的统计角度。脱叶示例
8. A Simple Statistical Method of Presenting the Uncertainty Associated with Life Cycle Cost Estimates. [R] . lewis, warfield m. jr 1973

机译：提出与生命周期成本估算相关的不确定性的简单统计方法。

Correlation-aware statistical methods for sampling-based group by estimates.

摘要

著录项

相似文献

相关主题

期刊订阅