【24h】

DFPS: Distributed FP-growth Algorithm Based on Spark

机译:DFPS:基于火花的分布式FP-生长算法

获取原文

摘要

Frequent Itemset Mining (FIM) is the most important and time-consuming step of association rules mining. With the increment of data scale, many efficient single-machine algorithms of FIM, such as FP-growth and Apriori, cannot accomplish the computing tasks within reasonable time. As a result of the limitation of single-machine methods, researchers presented some distributed algorithms based on MapReduce and Spark, such as PFP and YAFIM. Nevertheless, the heavy disk I/O cost at each MapReduce operation makes PFP not efficient enough. YAFIM needs to generate candidate frequent itemsets in each iterative step. It makes YAFIM time-consuming. And if the scale of data is large enough, YAFIM algorithm will not work due to the limitation of memory since the candidate frequent itemsets need to be stored in the memory. And the size of candidate itemsets is very large especially facing the massive data. In this work, we propose a distributed FP-growth algorithm based on Spark, we call it DFPS. DFPS partitions computing tasks in such a way that each computing node builds the conditional FP-tree and adopts a pattern fragment growth method to mine the frequent itemsets independently. DFPS doesn't need to pass messages between nodes during mining frequent itemsets. Our performance study shows that DFPS algorithm is more excellent than YAFIM, especially when the length of transactions is long, the number of items is large and the data is massive. And DFPS has an excellent scalability. The experimental results show that DFPS is more than 10 times faster than YAFIM for T10I4D100K dataset and Pumsb_star dataset.
机译:频繁的项目集挖掘(FIM)是关联规则挖掘最重要和最耗时的步骤。随着数据量表的增量,许多有效的FIM单机算法,例如FP-Grower和Apriori,不能在合理的时间内完成计算任务。由于单机方法的限制,研究人员基于MapReduce和Spark的一些分布式算法,例如PFP和Yafim。尽管如此,每个MapReduce操作时的沉重磁盘I / O成本使PFP不足以效率。 Yafim需要在每个迭代步骤中生成候选频繁项目集。它使Yafim耗时。如果数据的规模足够大,由于候选频繁项目集需要存储在内存中,Yafim算法将无法使用。并且候选项目的大小非常大,特别面临大规模数据。在这项工作中,我们提出了一种基于火花的分布式FP-生长算法,我们称之为DFP。 DFPS分区以这样的方式进行计算任务,即每个计算节点构建条件FP-Tree并采用模式片段生长方法,独立地挖掘频繁项目集。在挖掘频繁项集期间,DFP不需要在节点之间传递消息。我们的绩效研究表明,DFPS算法比Yafim更优秀,尤其是当交易长度长时,物品数量大,数据是大量的。 DFP具有出色的可扩展性。实验结果表明,DFPS比T10I4D100K数据集和PUMSB_STAR数据集的Yafim快10倍以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号