首页> 外文会议>IEEE International Conference on Big Data >CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis

【24h】

CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis

机译：CS *：使用可扩展连接相关样本概要的大数据对大数据的近似查询处理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Complex join queries are expensive to process on big data. Providing fast and accurate approximations to join queries with common aggregate functions can bring tremendous benefits in many fields such as data management, data mining, and machine learning. The state-of-the-art methods mainly focus on generating non-reusable samples during query time which can be costly for big data applications. In this research, we develop a scalable sample-based synopsis, called Scalable Join Correlated Sample Synopsis (or CS*), which can be pre-computed and doesn’t rely on any index structure. CS* only needs to be generated once and can be used to answer all future queries on the same database. It efficiently maintains join relationships between sampled tuples thanks to the introduced scheme of scalable join correlated sampling and a unique numerical value called join ratio (or JR). We further introduce two novel data structures, namely count trace and join correlated histogram, to optimize the calculation of JR values in map-reduce. For query estimations, multiple unbiased estimators are developed on CS* to provide fast and accurate approximations for join queries with common aggregate functions, acyclic or cyclic join graphs, and dangling tuples. The experimental study on large datasets demonstrates that CS* can be efficiently generated and provides accurate join query estimations with small sampling fractions.

机译：复杂的加入查询可以在大数据上处理昂贵。提供快速准确的近似加入查询，具有常见的聚合功能，可以在许多领域带来巨大的益处，如数据管理，数据挖掘和机器学习。最先进的方法主要关注在查询时间期间产生不可重复使用的样本，这可能是大数据应用的昂贵的。在本研究中，我们开发了一种基于可扩展的样本的概要，称为可扩展连接相关样本概要（或CS *），可以预先计算，并不依赖于任何索引结构。 CS *只需要生成一次，可用于在同一数据库上回答所有未来查询。它有效地维护采样元组之间的连接关系，得益于借助于可扩展连接相关采样的方案和称为连接比（或JR）的唯一数值。我们进一步引入了两个新颖的数据结构，即计数跟踪并加入相关直方图，以优化地图减少中的JR值的计算。对于查询估计，在CS *上开发了多个无偏见的估计器，以提供具有常见聚合函数，无循环或循环连接图和悬空元组的加入查询的快速准确近似。对大型数据集的实验研究表明，可以有效地生成CS *，并提供具有小型采样分数的准确连接查询估计。

著录项

来源
《IEEE International Conference on Big Data 》|2019年|769p|共10页
会议地点
作者
Feng Yu; Wen-Chi Hou;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类工程基础科学 ;
关键词
Estimation; Big Data; Histograms; Indexes; Query processing; Aggregates;

机译：估计;大数据;直方图;索引;查询处理;聚合;

相似文献

外文文献
中文文献
专利

1. Scalable Correlated Sampling for Join Query Estimations on Big Data [J] . Feng Yu, David S. Wilson, Tasha M. Wells, International journal of computers and their applications . 2020 ,第1期

机译：关于大数据的加入查询估计的可扩展相关采样
2. Approximate Query Processing on High Dimensionality Database Tables Using Multidimensional Cluster Sampling View [J] . Tomohiro Inoue*, Aneesh Krishna*, Raj P. Gopalan Journal of software . 2016 ,第1期

机译：使用多维簇抽样视图对高维数据库表进行近似查询处理
3. SAMPLING BASED JOIN-AGGREGATE QUERY PROCESSING TECHNIQUE FOR BIG DATA [J] . Praveen Kumar Sadineni Indian Journal of Computer Science and Engineering . 2020 ,第5期

机译：基于采样的大数据的Join-eggegate查询处理技术
4. CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis [C] . Feng Yu, Wen-Chi Hou IEEE International Conference on Big Data . 2019

机译：CS *：使用可伸缩联接相关样本摘要对大数据进行近似查询处理
5. Approximate Query Processing in a Data Warehouse Using Random Sampling [D] . ?Nguyen, Trong Duc 2020

机译：使用随机抽样的数据仓库中的近似查询处理
6. EAGLE—A Scalable Query Processing Engine for Linked Sensor Data [O] . Hoan Nguyen Mau Quoc, Martin Serrano, Han Mau Nguyen, 2019

机译：EAGLE-用于链接传感器数据的可扩展查询处理引擎
7. Scalable Correlated Sampling for Join Query Estimations on Big Data [O] . David Wilson, Wen-Chi Hou, Feng Yu -1

机译：关于大数据的加入查询估计的可扩展相关采样

CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis

摘要

著录项

相似文献

相关主题

期刊订阅