APPROXIMATION TECHNIQUES FOR EXECUTION OF AGGREGATE QUERIES ON BIG DATA

MALATESH S HAVANUR; DR. Y S KUMARASWAMY

首页> 外文期刊>International Journal of Engineering Science and Technology >APPROXIMATION TECHNIQUES FOR EXECUTION OF AGGREGATE QUERIES ON BIG DATA

【24h】

APPROXIMATION TECHNIQUES FOR EXECUTION OF AGGREGATE QUERIES ON BIG DATA

机译：大数据集合查询执行的逼近技术

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many enterprises are facing the Big Data challenge, wherein, huge volumes of data are generated by the enterprises at high velocity. This Big Data can influence the future directions for the enterprise. Hence, it is important to effectively and efficiently perform knowledge mining on Big Data. Most of the Big Data is generated in unstructured format, and performing structuring becomes extremely difficult due to high velocity of data generation. Due to such scenario, most of the analytical queries have to be executed over unstructured data. Knowledge mining involves execution of analytical queries. Most of the analytical queries involve aggregate operations such as-sum, variance etc. Executing analytical queries on Big Data can lead to explosion in query execution time due to huge data volume and unstructured data format. In many situations, large number of analytical queries needs to be executed for effective knowledge mining, and this execution exercise can become infeasible. The obvious and one of the most attractive solutions to overcome this bottleneck is to perform approximate query processing, wherein, approximate answers to the aggregate operations are provided through sampling in limited execution time. But, approximate query processing on Big Data has significant challenges. Since, the data is stored in unstructured format, random access to data might be difficult due to possible data dependency between the tuples stored inside the same data block, and due to the absence of efficient random access mechanisms such asindexes. Sampling schemes, in many cases also require the total number of tuples present in the data repository, which is again difficult to calculate due to unstructured flavor of stored data. Until now, this problem has not been effectively addressed in the literature. In this work, a new sampling scheme is proposed for approximate aggregate query execution on Big Data by using the Map Reduce model, which overcomes all the above mentioned bottlenecks. A randomization framework is proposed, which provides the facility to perform random data access. A new estimation scheme for estimating the number of tuples present in the data repository with good estimation accuracy is proposed. Sampling scheme based on Hoeffding in-equality theorem is proposed for approximate execution of aggregate operations. The proposed sampling scheme is empirically evaluated over real world Big Data set. The proposed sampling schemes exhibits excellent effectiveness in-terms of providing approximate answers to aggregate operations with good accuracy, and demonstrates appreciable execution efficiency.

机译：许多企业都面临着大数据挑战，其中企业高速生成大量数据。大数据可以影响企业的未来发展方向。因此，重要的是要有效地对大数据进行知识挖掘。大多数大数据都是以非结构化格式生成的，由于数据生成的速度很快，执行结构化变得极为困难。由于这种情况，大多数分析查询必须在非结构化数据上执行。知识挖掘涉及分析查询的执行。大多数分析查询涉及诸如和，方差等聚合操作。由于大数据量和非结构化数据格式，对大数据执行分析查询可能导致查询执行时间激增。在许多情况下，需要执行大量分析查询才能有效地进行知识挖掘，而这种执行工作可能变得不可行。克服此瓶颈的最明显且最有吸引力的解决方案之一是执行近似查询处理，其中，通过在有限的执行时间内进行采样来提供聚合操作的近似答案。但是，对大数据进行近似查询处理具有重大挑战。由于数据是以非结构化格式存储的，因此，由于存储在同一数据块内的元组之间可能存在数据依赖关系，并且由于缺少有效的随机访问机制（例如索引），因此可能难以对数据进行随机访问。在许多情况下，采样方案还需要数据存储库中存在的元组总数，由于存储数据的非结构化风味，再次很难计算。迄今为止，该问题尚未在文献中得到有效解决。在这项工作中，提出了一种新的抽样方案，该方案通过使用Map Reduce模型对大数据执行近似聚合查询，从而克服了上述所有瓶颈。提出了一个随机化框架，该框架为执行随机数据访问提供了便利。提出了一种新的估计方案，用于以良好的估计精度来估计数据库中存在的元组的数量。提出了一种基于霍夫丁不等式定理的采样方案，用于近似执行集合运算。建议的抽样方案是根据现实世界的大数据集进行经验评估的。所提出的采样方案在以正确的精度提供聚合操作的近似答案方面表现出出色的有效性，并展示了可观的执行效率。

著录项

来源
《International Journal of Engineering Science and Technology》 |2018年第1期|共10页
作者
MALATESH S HAVANUR; DR. Y S KUMARASWAMY;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类工业技术;
关键词

相似文献

外文文献
中文文献
专利

1. Two Database Related Interpretations of Rough Approximations: Data Organization and Query Execution [J] . Dominik Slezak, Piotr Synak, Arkadiusz Wojna, Fundamenta Informaticae . 2013,第1a4期

机译：粗略近似的两个与数据库相关的解释：数据组织和查询执行
2. Efficient Execution of Aggregate Queries on Big Data [J] . Malatesh. S. Havanur, Y. S. Kumarswamy International Journal of Applied Engineering Research . 2018,第11aPta5期

机译：高数据上的总查询的高效执行
3. Execution Planning for Continuous Queries over Dissemination Network of Data Aggregators [J] . M .Naresh Kumar, R.Sailaja International Journal of Computer Trends and Technology . 2013,第10期

机译：数据聚合器传播网络上连续查询的执行计划
4. Efficient Execution of Parallel Aggregate Data Cube Queries in Data Warehouse Environments [C] . Rebecca Boon-Noi Tan, David Taniar, Guojun Lu International conference on intelligent data engineering and automated learning . 2003

机译：高效执行数据仓库环境中的并行聚合数据CUBE查询
5. Techniques for Accelerating Aggregated Range Queries on Large Multidimensional Datasets in Interactive Visual Exploration [D] . ?Wang, Zhe 2019

机译：在交互式视觉探索中加速大型多维数据集的聚合范围查询的技术
6. A review of statistical disclosure control techniques employed by web-based data query systems [O] . Gregory J. Matthews, Ofer Harel, Robert H. Aseltine Jr. -1

机译：基于Web的数据查询系统所采用的统计信息披露控制技术的回顾
7. Comparative Analysis of Skyline Query Execution using Imputation Techniques on Partially Complete Data [O] . S. Kanmani, E. Kirubakaran, Elijah Rajsingh 2021

机译：使用额度数据的撤销技术对天际线查询执行的比较分析

APPROXIMATION TECHNIQUES FOR EXECUTION OF AGGREGATE QUERIES ON BIG DATA

摘要

著录项

相似文献

相关主题

期刊订阅