首页> 中文期刊> 《天津理工大学学报》 >基于MapReduce的频繁模式挖掘算法的优化

基于MapReduce的频繁模式挖掘算法的优化

             

摘要

分布式数据挖掘计算是大数据研究中非常重要的技术,现有的对频繁模式的分布式挖掘方法在处理大量数据集时仍然存在许多局限,如并行Apriori算法在多次扫描数据库过程中对I/O产生很大负担,并且有大量候选集产生.本文使用的FP-growth算法包括Fp-tree构建和频繁模式挖掘两个阶段.主要思想是在map阶段构建FP-tree之前,根据步长值及项目元素编码对FP-tree节点合并,并在shuffle阶段依据平衡算法划分给不同的reducer.平衡算法用来均衡工作负载.利用该算法来降低数据分配的随机性,避免数据挖掘阶段由于数据划分不均衡导致部分reducer开销过大的缺点.实验结果表明:与现有方法相比,在较大数据集情况下改进后的算法具有更好地运算效率和可伸缩性.%Distributed data mining calculation is critical in the study of big data technology.For the existing frequent pattern mining method,there are still many limitations in dealing with large data sets,such as parallel Apriori algorithm,which has a great burden on I/O in the process of frequently scanning database,and there are a large number of candidate sets.This paper proposes FP-growth algorithm with FP-tree construction and mining frequent patterns in two stages.The main idea is to merge the node of FP-tree according to the step value and item elements encoding before map stage,and in shuffle stage the encoding items are distributed to different reducer according to the balance algorithm.The balance algorithm is used to balance varied workload.The algorithm is used to reduce the randomness of data distribution and avoid the disadvantages of unbalanced data classification in certain reducer causing too much overhead.The experimental results show that compared with the existing methods,in the case of large data sets the improved algorithm has better computation efficiency and scalability.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号