首页> 中文期刊> 《计算机技术与发展》 >基于MapReduce的Apriori算法并行化改进

基于MapReduce的Apriori算法并行化改进

         

摘要

基于MapReduce的并行Apriori算法解决了传统Apriori算法多次扫描数据库的问题,但是其候选集仍然由频繁项集经过串行自连接产生,并产生了大量的候选集中间数据.为了提高Apriori算法挖掘频繁项集的效率,在基于MapReduce的Apriori算法的基础上对连接步进行并行化改进,提出大数据环境下挖掘频繁项目集的新算法-CApriori算法.新算法通过Map、Reduce过程从频繁 k- 项集中并行得到 k+1 项候选集,使得Apriori算法产生频繁项集的整个过程并行化,减少了迭代过程中候选集数目,节约了存储空间和时间开销.通过对时间复杂度进行分析比较,改进算法在处理大规模数据时会大大减少连接步的时间消耗.将CApriori算法在Hadoop平台上进行了实验,结果表明改进算法在大数据和较小支持度环境下都具有更高的效率,且能取得优异的加速功能.%The parallel Apriori algorithm based on the MapReduce solves the problem that the traditional Apriori algorithm scans database for many times,but the candidates are still generated from the connection of serial by the frequent itemsets and generate a large number of data.In order to improve the efficiency of mining frequent itemsets for Apriori,an improved parallel Apriori algorithm named CApriori is proposed in large data environment,which realizes parallel candidate generation steps under MapReduce framework.The new algorithm generates the k+1 candidate itemsets from k frequent itemsets through the process of Map and Reduce,which makes the whole process of generating frequent item sets in parallel,reducing the number of candidate sets,saving storage space and time overhead.On analysis of the time complexity of CApriori algorithm and Apriori algorithm,it indicates that CApriori algorithm reduces the time consumed when connected in dealing with large-scale data.CApriori is executed on Hadoop platform and the results show that the improved algorithm in big data environment and smaller support is more efficient,and can obtain excellent acceleration.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号