首页> 外文期刊>International Journal of Engineering Science and Technology >APPROXIMATION TECHNIQUES FOR EXECUTION OF AGGREGATE QUERIES ON BIG DATA
【24h】

APPROXIMATION TECHNIQUES FOR EXECUTION OF AGGREGATE QUERIES ON BIG DATA

机译:大数据集合查询执行的逼近技术

获取原文
           

摘要

Many enterprises are facing the Big Data challenge, wherein, huge volumes of data are generated by the enterprises at high velocity. This Big Data can influence the future directions for the enterprise. Hence, it is important to effectively and efficiently perform knowledge mining on Big Data. Most of the Big Data is generated in unstructured format, and performing structuring becomes extremely difficult due to high velocity of data generation. Due to such scenario, most of the analytical queries have to be executed over unstructured data. Knowledge mining involves execution of analytical queries. Most of the analytical queries involve aggregate operations such as-sum, variance etc. Executing analytical queries on Big Data can lead to explosion in query execution time due to huge data volume and unstructured data format. In many situations, large number of analytical queries needs to be executed for effective knowledge mining, and this execution exercise can become infeasible. The obvious and one of the most attractive solutions to overcome this bottleneck is to perform approximate query processing, wherein, approximate answers to the aggregate operations are provided through sampling in limited execution time. But, approximate query processing on Big Data has significant challenges. Since, the data is stored in unstructured format, random access to data might be difficult due to possible data dependency between the tuples stored inside the same data block, and due to the absence of efficient random access mechanisms such asindexes. Sampling schemes, in many cases also require the total number of tuples present in the data repository, which is again difficult to calculate due to unstructured flavor of stored data. Until now, this problem has not been effectively addressed in the literature. In this work, a new sampling scheme is proposed for approximate aggregate query execution on Big Data by using the Map Reduce model, which overcomes all the above mentioned bottlenecks. A randomization framework is proposed, which provides the facility to perform random data access. A new estimation scheme for estimating the number of tuples present in the data repository with good estimation accuracy is proposed. Sampling scheme based on Hoeffding in-equality theorem is proposed for approximate execution of aggregate operations. The proposed sampling scheme is empirically evaluated over real world Big Data set. The proposed sampling schemes exhibits excellent effectiveness in-terms of providing approximate answers to aggregate operations with good accuracy, and demonstrates appreciable execution efficiency.
机译:许多企业都面临着大数据挑战,其中企业高速生成大量数据。大数据可以影响企业的未来发展方向。因此,重要的是要有效地对大数据进行知识挖掘。大多数大数据都是以非结构化格式生成的,由于数据生成的速度很快,执行结构化变得极为困难。由于这种情况,大多数分析查询必须在非结构化数据上执行。知识挖掘涉及分析查询的执行。大多数分析查询涉及诸如和,方差等聚合操作。由于大数据量和非结构化数据格式,对大数据执行分析查询可能导致查询执行时间激增。在许多情况下,需要执行大量分析查询才能有效地进行知识挖掘,而这种执行工作可能变得不可行。克服此瓶颈的最明显且最有吸引力的解决方案之一是执行近似查询处理,其中,通过在有限的执行时间内进行采样来提供聚合操作的近似答案。但是,对大数据进行近似查询处理具有重大挑战。由于数据是以非结构化格式存储的,因此,由于存储在同一数据块内的元组之间可能存在数据依赖关系,并且由于缺少有效的随机访问机制(例如索引),因此可能难以对数据进行随机访问。在许多情况下,采样方案还需要数据存储库中存在的元组总数,由于存储数据的非结构化风味,再次很难计算。迄今为止,该问题尚未在文献中得到有效解决。在这项工作中,提出了一种新的抽样方案,该方案通过使用Map Reduce模型对大数据执行近似聚合查询,从而克服了上述所有瓶颈。提出了一个随机化框架,该框架为执行随机数据访问提供了便利。提出了一种新的估计方案,用于以良好的估计精度来估计数据库中存在的元组的数量。提出了一种基于霍夫丁不等式定理的采样方案,用于近似执行集合运算。建议的抽样方案是根据现实世界的大数据集进行经验评估的。所提出的采样方案在以正确的精度提供聚合操作的近似答案方面表现出出色的有效性,并展示了可观的执行效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号