A Spark-based Apriori algorithm with reduced shuffle overhead

Shashi Raj; Dharavath Ramesh; Krishan Kumar Sethi

首页> 外文期刊>Journal of supercomputing >A Spark-based Apriori algorithm with reduced shuffle overhead

【24h】

A Spark-based Apriori algorithm with reduced shuffle overhead

机译：一种基于火花的APRiori算法，减少洗牌开销

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory ion called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

机译：挖掘频繁项目集被视为从事务数据集查找关联规则的核心活动。在寻找频繁项目集的不同众所周知的方法中，APRIORI算法是最早提出的。许多尝试已经采用了用于大型数据集的APRiori算法。但是与APRIORI相关的瓶颈如/例如反复扫描输入数据集，在计算其支持值等之前的所有候选项目的生成降低了APRIORI对大型数据集的有效性。当数据大小很大时，甚至使用MapReduce框架的Apriori的分布式和并行实现也不表现良好。这是由于识别高磁盘开销的算法的迭代性质。在每次迭代中，驻留在磁盘上的输入数据集，导致高磁盘I / O. Apache Spark实现的Apriori由于内存的处理功能而显示出更好的性能。它通过将其保持在称为弹性分布式数据集（RDD）的存储器离子中来更快地扫描数据集。 RDD以键值对跨群集节点的键值对形式保留数据集。 RDD操作需要在处理过程中重新分配这些键值对。此重新分配或随机操作会引发通信和同步开销。在这个手稿中，我们提出了一种新的方法，即基于火花的APRiori算法，其减少了减少的Shuffle开销（SARSO）。它利用Spark的并行和分布式计算环境的好处，并且它是内存处理功能。它通过减少由每次迭代的RDD操作引起的Shuffle开销来提高效率。换句话说，它通过使用分区方法限制群集节点跨群集节点的键值对的移动，从而减少了火花洗车操作所产生的必要通信和同步开销。已经进行了广泛的实验，以测量SARSO对基准数据集的性能，并与现有算法进行比较。实验结果表明，SARSO在运行时间和可扩展性方面具有更好的性能。

著录项

来源
《Journal of supercomputing》 |2021年第1期|133-151|共19页
作者
Shashi Raj; Dharavath Ramesh; Krishan Kumar Sethi;
展开▼
作者单位

Bakhtiyarpur College of Engineering;

Indian Institute of Technology (ISM);

Indian Institute of Technology (ISM);

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Apache Spark; Apriori algorithm; Large-scale datasets; Shuffle overhead;

机译：apache spark;apriori算法;大规模数据集;洗牌开销;

相似文献

外文文献
中文文献
专利

1. A hybrid algorithm combining weighted and hashT apriori algorithms in Map Reduce model using Eucalyptus cloud platform [J] . R. SUMITHRA, SUJNI PAUL, D. PONMARY PUSHPA LATHA WSEAS Transactions on Computers . 2015,第Null期

机译：使用Eucalyptus云平台在Map Reduce模型中结合加权和hashT先验算法的混合算法
2. Survey on map reduce based apriori algorithms in medical field for the prediction of diabetes mellitus [J] . Muni Kumar N, Manjula R Research Journal of Fisheries and Hydrobiology . 2016,第1期

机译：基于地图约简的医学领域先验算法在糖尿病预测中的应用
3. Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments [J] . Zhang Danping, Yu Haoran, Zheng Linyu The Open Automation and Control Systems Journal . 2016,第1期

机译：云计算环境中基于Map-Reduce的先验算法研究
4. Apriori-Map/Reduce Algorithm [C] . Jongwook Woo International conference on parallel and distributed processing techniques and applications . 2012

机译：Apriori-Map / Reduce算法
5. Performance evaluation of multicast routing algorithms to trade-off path length and bandwidth consumption and of a protocol to reduce messaging overhead. [D] . Fujinoki, Hiroshi. 2001

机译：对多播路由算法进行性能评估以权衡路径长度和带宽消耗，并对协议进行性能评估以减少消息传递开销。
6. Spark-Based Parallel Genetic Algorithm for Simulating a Solution of Optimal Deployment of an Underwater Sensor Network [O] . Peng Liu, Shuai Ye, Can Wang, 2019

机译：基于火花的并行遗传算法用于仿真水下传感器网络的最优部署方案
7. PERFORMANCE EVALUATION OF VARIOUS COOPERATIVE SPECTRUM SENSING ALGORITHMS FOR REDUCING COMMUNICATION OVERHEADS [O] . Renu Kumari 2015

机译：用于减少通信开销的各种协作频谱感知算法的性能评估
8. Matrix and position correction of shuffler assays by application of the alternating conditional expectation algorithm to shuffler data [R] . Pickrell, M M, Rinard, P M 1992

机译：通过将交替条件期望算法应用于洗牌数据来进行洗牌分析的矩阵和位置校正

A Spark-based Apriori algorithm with reduced shuffle overhead

摘要

著录项

相似文献

相关主题

期刊订阅