首页> 外文会议>IEEE International Conference on Big Data >CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis
【24h】

CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis

机译:CS *:使用可扩展连接相关样本概要的大数据对大数据的近似查询处理

获取原文

摘要

Complex join queries are expensive to process on big data. Providing fast and accurate approximations to join queries with common aggregate functions can bring tremendous benefits in many fields such as data management, data mining, and machine learning. The state-of-the-art methods mainly focus on generating non-reusable samples during query time which can be costly for big data applications. In this research, we develop a scalable sample-based synopsis, called Scalable Join Correlated Sample Synopsis (or CS*), which can be pre-computed and doesn’t rely on any index structure. CS* only needs to be generated once and can be used to answer all future queries on the same database. It efficiently maintains join relationships between sampled tuples thanks to the introduced scheme of scalable join correlated sampling and a unique numerical value called join ratio (or JR). We further introduce two novel data structures, namely count trace and join correlated histogram, to optimize the calculation of JR values in map-reduce. For query estimations, multiple unbiased estimators are developed on CS* to provide fast and accurate approximations for join queries with common aggregate functions, acyclic or cyclic join graphs, and dangling tuples. The experimental study on large datasets demonstrates that CS* can be efficiently generated and provides accurate join query estimations with small sampling fractions.
机译:复杂的加入查询可以在大数据上处理昂贵。提供快速准确的近似加入查询,具有常见的聚合功能,可以在许多领域带来巨大的益处,如数据管理,数据挖掘和机器学习。最先进的方法主要关注在查询时间期间产生不可重复使用的样本,这可能是大数据应用的昂贵的。在本研究中,我们开发了一种基于可扩展的样本的概要,称为可扩展连接相关样本概要(或CS *),可以预先计算,并不依赖于任何索引结构。 CS *只需要生成一次,可用于在同一数据库上回答所有未来查询。它有效地维护采样元组之间的连接关系,得益于借助于可扩展连接相关采样的方案和称为连接比(或JR)的唯一数值。我们进一步引入了两个新颖的数据结构,即计数跟踪并加入相关直方图,以优化地图减少中的JR值的计算。对于查询估计,在CS *上开发了多个无偏见的估计器,以提供具有常见聚合函数,无循环或循环连接图和悬空元组的加入查询的快速准确近似。对大型数据集的实验研究表明,可以有效地生成CS *,并提供具有小型采样分数的准确连接查询估计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号