首页> 外文学位 >Correlation-aware statistical methods for sampling-based group by estimates.
【24h】

Correlation-aware statistical methods for sampling-based group by estimates.

机译:基于采样的分组依据估计的相关感知统计方法。

获取原文
获取原文并翻译 | 示例

摘要

Over the last decade, Data Warehousing and Online Analytical Processing (OLAP) have gained much interest from industry because of the need for processing analytical queries for business intelligence and decision support. A typical analytical query may require long evaluation time because analytical queries are complicated, and because the datasets used to evaluate analytical queries are large. One key problem arising from long evaluation time is that no feedback is given until the query is fully evaluated. This is problematic for several reasons. First, this makes query debugging very difficult. Second, the long running time also discourages users to explore the data interactively. One way to speed up the evaluation time is to use approximate query processing techniques, such as sampling. Researchers have developed scalable approximate query processing techniques for SELECT-PROJECT-JOIN-AGGREGATE queries. However, most work has ignored GROUP BY queries. This is a significant hole in the state-of-the-art, since the GROUP BY query is an important type of OLAP query. For example, more than two thirds of the public TPC-H benchmark queries are GROUP BY queries. Running a GROUP BY query in an approximate query processing system requires the same sample to be used to estimate the result of each group, which induces correlations among the estimates. Thus from a statistical point of view, providing estimation information for a GROUP BY query without considering the correlations is problematic and probably misleading. In this thesis, I formally address this problem and provide correlation-aware statistical methods to answer sampling-based GROUP BY queries. I make three specific contributions to the state-of-the-art in this area. First, I formally characterize the correlations among the groupwise estimates. Second, I develop methods to provide correlation-aware simultaneous confidence bounds for GROUP BY queries. Finally I develop correlation-aware statistical methods to return all "top-k" groups with high probability when only database samples are available.
机译:在过去的十年中,由于需要为商务智能和决策支持处理分析查询,因此数据仓库和在线分析处理(OLAP)引起了业界的极大兴趣。典型的分析查询可能需要较长的评估时间,因为分析查询很复杂,并且用于评估分析查询的数据集很大。评估时间长导致的一个关键问题是,在对查询进行完全评估之前,不会给出任何反馈。这出于几个原因是有问题的。首先,这使查询调试非常困难。其次,运行时间长还阻碍了用户以交互方式浏览数据。加快评估时间的一种方法是使用近似查询处理技术,例如抽样。研究人员已经为SELECT-PROJECT-JOIN-AGGREGATE查询开发了可伸缩的近似查询处理技术。但是,大多数工作都忽略了GROUP BY查询。这是最新技术的一个重大漏洞,因为GROUP BY查询是OLAP查询的重要类型。例如,超过三分之二的公共TPC-H基准查询是GROUP BY查询。在近似查询处理系统中运行GROUP BY查询需要使用相同的样本来估计每个组的结果,这会导致估计之间的相关性。因此,从统计角度来看,在不考虑相关性的情况下为GROUP BY查询提供估计信息是有问题的,并且可能会产生误导。在本文中,我正式解决了这个问题,并提供了相关相关的统计方法来回答基于采样的GROUP BY查询。我对这一领域的最新技术做出了三点具体的贡献。首先,我正式描述了各组估计之间的相关性。其次,我开发了为GROUP BY查询提供相关感知的同时置信范围的方法。最后,我开发了相关感知统计方法,以在只有数据库样本可用时以高概率返回所有“前k个”组。

著录项

  • 作者

    Xu, Fei.;

  • 作者单位

    University of Florida.;

  • 授予单位 University of Florida.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 139 p.
  • 总页数 139
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号