首页> 外文会议>International conference on very large data bases >Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
【24h】

Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

机译:不同的采样,用于对不同价值的高度准确答案查询和事件报告

获取原文

摘要

Estimating the number of distinct values is a well-studied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained up-to-date in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and real-world data sets, we show that distinct sampling gives distinct-values estimates to within 0%-10% relative error, whereas previous methods typically incur 50%-250% relative error. Next, we show how distinct sampling can provide fast, highly-accurate approximate answers for "report" queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1% Distinct Sample provides approximate answers typically to within 0%-10% relative error, while speeding up report generation by 2-4 orders of magnitude.
机译:估计不同值的数量是一个良好研究的问题,因为它频繁发生了疑问以及选择良好查询计划的重要性。以前的工作表明了强大的负面导致基于采样的不同值估计的质量(或仅检查一部分输入数据的技术)。我们提出了一种称为不同的采样的方法,它在输入数据的单个扫描中,在输入中的不同值中收集特殊定制的样本。与先前的负面结果相比,我们的小明显样品得到了准确地估计不同值的数量。在存在数据插入和删除的情况下,可以在存在数据插入和删除时递增地维护样本,其时间和内存开销,从而可以仅执行全扫描一次。此外,存储的不同样本可用于精确地估计查询指定的任何范围内的不同值的数量,或者在满足查询谓词的数据的任何其他子集中。我们提出了对不同抽样的广泛实验研究。使用综合性和实世界数据集,我们显示不同的采样使不同的值估计到0%-10%的相对误差内,而以前的方法通常会产生50%-250%的相对误差。接下来,我们展示了不同的采样如何为高卷,基于会话的事件记录环境中的“报告”查询提供快速,高度准确的近似答案,例如IP网络,客户服务呼叫中心等商业呼叫中心环境,我们表明,1%的不同样本提供了近似答案,通常在0%-10%的相对误差内,同时加快报告生成2-4级的数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号