首页> 外文会议>IEEE Advanced Information Technology, Electronic and Automation Control Conference >DFPS: Distributed FP-growth Algorithm Based on Spark

【24h】

DFPS: Distributed FP-growth Algorithm Based on Spark

机译：DFPS：基于火花的分布式FP-生长算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.

机译：频繁的项目集挖掘（FIM）是关联规则挖掘最重要和最耗时的步骤。随着数据量表的增量，许多有效的FIM单机算法，例如FP-Grower和Apriori，不能在合理的时间内完成计算任务。由于单机方法的限制，研究人员基于MapReduce和Spark的一些分布式算法，例如PFP和Yafim。尽管如此，每个MapReduce操作时的沉重磁盘I / O成本使PFP不足以效率。 Yafim需要在每个迭代步骤中生成候选频繁项目集。它使Yafim耗时。如果数据的规模足够大，由于候选频繁项目集需要存储在内存中，Yafim算法将无法使用。并且候选项目的大小非常大，特别面临大规模数据。在这项工作中，我们提出了一种基于火花的分布式FP-生长算法，我们称之为DFP。 DFPS分区以这样的方式进行计算任务，即每个计算节点构建条件FP-Tree并采用模式片段生长方法，独立地挖掘频繁项目集。在挖掘频繁项集期间，DFP不需要在节点之间传递消息。我们的绩效研究表明，DFPS算法比Yafim更优秀，尤其是当交易长度长时，物品数量大，数据是大量的。 DFP具有出色的可扩展性。实验结果表明，DFPS比T10I4D100K数据集和PUMSB_STAR数据集的Yafim快10倍以上。

著录项

来源
《IEEE Advanced Information Technology, Electronic and Automation Control Conference 》|2017年|1360-2045p|共7页
会议地点
作者
Xiujin Shi; Shaozong Chen; Hui Yang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP2-53;
关键词
Frequent itemset mining; Association rules mining; FP-growth; Spark; RDD; Big Data; Distributed algorithm;

机译：频繁的项目集;关联规则挖掘;FP-生长;火花;RDD;大数据;分布式算法;

相似文献

外文文献
中文文献
专利

1. The Application of FP-Growth Algorithm Based on Distributed Intelligence in Wisdom Medical Treatment [J] . Xu Fangqin, Lu Haifeng International Journal of Pattern Recognition and Artificial Intelligence . 2017 ,第4期

机译：基于分布式智能的FP增长算法在智慧医疗中的应用
2. Privacy Preserving Association Rule Mining in Distributed Environments using Fp-Growth Algorithm and Elliptic Curve Cryptography [J] . T. Nusrat Jabeen, M. Chidambaram Indian Journal of Science and Technology . 2017 ,第48期

机译：使用Fp-Growth算法和椭圆曲线密码学的分布式环境中隐私保护关联规则挖掘
3. Screening hardware and volume factors in distributed machine learning algorithms on spark: A design of experiments (DoE) based approach [J] . Rodrigues Jairson B., Vasconcelos Germano C., Maciel Paulo R. M. Computing . 2021 ,第10期

机译：筛选在火花上分布式机器学习算法中的筛选硬件和体积因子：基于实验的设计（DOE）方法
4. DFPS: Distributed FP-growth Algorithm Based on Spark [C] . Xiujin Shi, Shaozong Chen, Hui Yang IEEE Advanced Information Technology, Electronic and Automation Control Conference . 2017

机译：DFPS：基于火花的分布式FP-生长算法
5. Index-based FP-growth: Distributed data mining. [D] . Khemani, Ravi Prakash. 2002

机译：基于索引的FP增长：分布式数据挖掘。
6. SAW Torque Sensor Gyroscopic Effect Compensation by Least Squares Support Vector Machine Algorithm Based on Chaos Estimation of Distributed Algorithm [O] . Wei Han, Xiongzhu Bu, Yihan Cao, 2019

机译：基于分布式算法混沌估计的最小二乘支持向量机算法补偿声表面波转矩传感器陀螺效应
7. A Ranking-Based Hashing Algorithm Based on the Distributed Spark Platform [O] . Anbang Yang, Jiangbo Qian, Huahui Chen, 2020

机译：一种基于分布式火花平台的排名散列算法

DFPS: Distributed FP-growth Algorithm Based on Spark

摘要

著录项

相似文献

相关主题

期刊订阅