首页> 中文期刊> 《计算机工程与设计》 >基于Spark改进的最大频繁项集挖掘算法

基于Spark改进的最大频繁项集挖掘算法

         

摘要

为解决面向大规模高维数据的频繁项集挖掘问题,针对传统算法的时空复杂度和并行化策略进行优化,实现基于Spark改进的最大频繁项集挖掘算法.结合Spark的分布式框架和DMFIA算法的优点,提出深度路径搜索和长度优先超集检验两项改进方法,进行深度路径递归搜索一次性生成最大频繁项候选集,对候选项集进行长度优先排序并检验超集,降低候选项集的规模和挖掘次数,解决传统最大频繁项集挖掘算法在数据量大、维度高时效率低的问题.实验结果表明,该算法在时间上优于同类算法2-4倍,对数据集规模具有良好的扩展性.%To solve the problem of mining frequent itemsets from data with large scale and high dimension,traditional algorithm was optimized from two aspects including time and space complexity and parallelization strategy.A refined algorithm was proposed based on Spark,combining the advantage of Spark distributed framework and DMFIA algorithm,with improvements by depth path search and length-first superset test.The reduction of efficiency in conventional maximum frequent data mining algorithms in large scale and high dimensional datasets was avoided,by utilizing depth-first search algorithm to generate maximum candidate frequent set,and sorting the acquired dataset by length and testing superset cyclically.Experimental results indicate that the proposed algorithm is 2-4 times faster than conventional algorithm and demonstrate its strong adaptability in different datasets of various scales.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号