PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

Raj Shashi; Ramesh Dharavath

摘要

Frequent itemset mining (FIM) is one of the prominent techniques to extract knowledge from transactional databases. Finding frequent itemsets becomes tiresome while dealing with large-scale datasets due to the enormous demand for computational resources. Cluster-based versions of FIM algorithms become a natural choice to make FIM algorithms efficient and scalable for large-scale datasets and to meet the ever-growing data needs. A variety of MapReduce-based algorithms on the Hadoop cluster has been developed to meet the expectations. Due to the iterative nature of the FIM algorithms, they are still unable to perform adequately. Bottlenecks associated with MapReduce-based FIMs include challenges originating from adapting FIM algorithms in parallel and distributed contexts, as well as reading and publishing intermediate results to HDFS, which necessitates significant disc and communication I/O. Many FIM algorithms have been redesigned on Spark to achieve efficiency for large-scale datasets by utilizing its in-memory processing capabilities. However, Spark-based FIM adaptations still face the same challenges except achieving memory efficiency. The limitations associated with basic FIM algorithms, such as repeated scanning of input data, massive candidate generation, and maintaining large conditional trees in memory, still need to be addressed. Also, tuning of Spark's shuffle behavior becomes important to control the communication overhead. In this paper, we propose a Spark-based algorithm, namely PartEclat. It uses the Eclat method in combination with the partitioning technique to gain efficiency and scalability. Vertical data format helps to avoid repeated scanning of input data to calculate individual support values. It utilizes the benefits of the partitioning technique to limit the movement of key-value pairs across the cluster nodes (shuffle overheads) during iterations. Also, it helps to deal efficiently with the memory requirements to handle large Transaction ID (TID) sets and computational overhead imposed in computing the intersection of these large TID sets. In-depth experiments with benchmark datasets have been conducted to gain insight into the efficiency and scalability performance of the algorithm. Experimental results show that the proposed scheme outperforms other Spark-based Eclat variants by reducing network and computing loads by approx. 25-40.

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

摘要

著录项

相关主题

期刊订阅