Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

机译：独特的采样，可对不同的值查询和事件报告提供高精度的答案

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Estimating the number of distinct values is a well-studied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained up-to-date in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and real-world data sets, we show that distinct sampling gives distinct-values estimates to within 0%-10% relative error, whereas previous methods typically incur 50%-250% relative error. Next, we show how distinct sampling can provide fast, highly-accurate approximate answers for "report" queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1% Distinct Sample provides approximate answers typically to within 0%-10% relative error, while speeding up report generation by 2-4 orders of magnitude.

机译：估计不同值的数量是一个经过充分研究的问题，这是因为它在查询中经常发生并且在选择良好的查询计划中很重要。先前的工作已显示出基于采样（或仅检查部分输入数据的其他技术）对不同值估计质量的强大负面结果。我们提供了一种称为“不重复采样”的方法，该方法可以在一次数据扫描中针对输入中的不重复值收集经过特别定制的样本。与之前的负面结果相比，我们的小型“不同样本”可以保证准确估计不同值的数量。在存在数据插入和删除的情况下，可以以最小的时间和内存开销来增量地维护样本的最新状态，以便完整扫描只能执行一次。此外，可以使用存储的“不同样本”来准确估计查询指定的任何范围内或满足查询谓词的数据的任何其他子集内的不同值的数量。我们介绍了不同采样的广泛实验研究。使用合成的和真实的数据集，我们显示出不同的采样可以将不同值的估计值的相对误差控制在0％-10％之内，而以前的方法通常会产生50％-250％的相对误差。接下来，我们展示不同的采样如何在大批量，基于会话的事件记录环境（例如IP网络，客户服务呼叫中心等）中为“报告”查询提供快速，高精度的近似答案。对于商业呼叫中心环境中，我们显示1％的差异样本通常会在0％-10％的相对误差内提供近似答案，同时将报告生成速度加快2-4个数量级。

著录项

来源
《Twenty-Seventh International Conference on Very Large Data Bases, 27th, Sep 11-14th, 2001, Roma, Italy》|2001年|p.541-550|共10页
会议地点 Roma(IT);Roma(IT)
作者
Phillip B. Gibbons;
展开▼
作者单位

Information Sciences Research Center Bell Laboratories 600 Mountain Avenue Murray Hill NJ 07974;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Comparing Three Distinct Samples on Traumatic Events, Post Traumatic Stress Disorder and Dysfunctional Coping Styles [J] . Gary Blau, Glen Miller Journal of Educational and Developmental Psychology . 2021,第1期

机译：比较创伤事件的三种不同样本，后创伤后应激障碍和功能失调的应对方式
2. The number of distinct values in a geometrically distributed sample [J] . Margaret Archibald, Arnold Knopfmacher, Helmut Prodinger European journal of combinatorics . 2006,第7期

机译：几何分布样本中不同值的数量
3. In silico analysis of bacterial translation factors reveal distinct translation event specific pI values [J] . Soma Jana, Partha P. Datta BMC Genomics . 2021,第1期

机译：在硅的细菌翻译因子分析中，揭示了明显的翻译事件特定的PI值
4. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports [C] . Phillip B. Gibbons International conference on very large data bases . 2001

机译：不同的采样，用于对不同价值的高度准确答案查询和事件报告
5. Do different types of negative events lead to distinct adaptive functioning threats? [D] . Mansfield, Cade D. 2015

机译：不同类型的负面事件会导致明显的适应性功能威胁吗？
6. In silico analysis of bacterial translation factors reveal distinct translation event specific pI values [O] . Soma Jana, Partha P. Datta 2021

机译：在细菌翻译因子的硅分析中揭示了明显的翻译事件特定的pi值
7. Comparing Three Distinct Samples on Traumatic Events, Post Traumatic Stress Disorder and Dysfunctional Coping Styles [O] . Gary Blau, Glen Miller 2021

机译：比较三种不同的样品对创伤事件，后创伤后应激障碍和功能失调的样式

Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅