Many enterprises are facing the Big Data challenge, wherein, huge volumes of data are generated by the enterprises at high velocity. This Big Data can influence the future directions for the enterprise. Hence, it is important to effectively and efficiently perform knowledge mining on Big Data. Most of the Big Data is generated in unstructured format, and performing structuring becomes extremely difficult due to high velocity of data generation. Due to such scenario, most of the analytical queries have to be executed over unstructured data. Knowledge mining involves execution of analytical queries. Most of the analytical queries involve aggregate operations such as-sum, variance etc. Executing analytical queries on Big Data can lead to explosion in query execution time due to huge data volume and unstructured data format. In many situations, large number of analytical queries needs to be executed for effective knowledge mining, and this execution exercise can become infeasible. The obvious and one of the most attractive solutions to overcome this bottleneck is to perform approximate query processing, wherein, approximate answers to the aggregate operations are provided through sampling in limited execution time. But, approximate query processing on Big Data has significant challenges. Since, the data is stored in unstructured format, random access to data might be difficult due to possible data dependency between the tuples stored inside the same data block, and due to the absence of efficient random access mechanisms such asindexes. Sampling schemes, in many cases also require the total number of tuples present in the data repository, which is again difficult to calculate due to unstructured flavor of stored data. Until now, this problem has not been effectively addressed in the literature. In this work, a new sampling scheme is proposed for approximate aggregate query execution on Big Data by using the Map Reduce model, which overcomes all the above mentioned bottlenecks. A randomization framework is proposed, which provides the facility to perform random data access. A new estimation scheme for estimating the number of tuples present in the data repository with good estimation accuracy is proposed. Sampling scheme based on Hoeffding in-equality theorem is proposed for approximate execution of aggregate operations. The proposed sampling scheme is empirically evaluated over real world Big Data set. The proposed sampling schemes exhibits excellent effectiveness in-terms of providing approximate answers to aggregate operations with good accuracy, and demonstrates appreciable execution efficiency.
展开▼