首页> 外文会议>International Conference on High Performance Computing Simulation >A parallel algorithm for approximate frequent itemset mining using MapReduce
【24h】

A parallel algorithm for approximate frequent itemset mining using MapReduce

机译:使用MapReduce的近似常见项目集挖掘的并行算法

获取原文

摘要

Recently, several algorithms based on the MapReduce framework have been proposed for frequent pattern mining in Big Data. However, the proposed solutions come with their own technical challenges, such as inter-communication costs, in-process synchronizations, balanced data distribution and input parameters tuning, which negatively affect the computation time. In this paper we present MrAdam, a novel parallel, distributed algorithm which addresses these problems. The key principle underlying the design of MrAdam is that one can make reasonable decisions in the absence of perfect answers. Indeed, given the classical threshold for minimum support and a user-specified error bound, MrAdam exploits the Chernoff bound to mine “approximate” frequent itemsets with statistical error guarantees on their actual supports. These itemsets are generated in parallel and independently from subsets of the input dataset, by exploiting the MapReduce parallel computation framework. The result collections of frequent itemsets from each subset are aggregated and filtered by using a novel technique to provide a single collection in output. MrAdam can scale well on gigabytes of data and tens of machines, as experimentally proven on real datasets. In the experiments we also show that the proposed algorithm returns a good statistically bounded approximation of the exact results.
机译:最近,已经提出了基于MapReduce框架的几种算法,以便在大数据中频繁模式挖掘。然而,所提出的解决方案具有自己的技术挑战,例如通信费用,过程同步,平衡数据分布和输入参数调整,这对计算时间产生负面影响。在本文中,我们提出了MRADAM,这是一种解决这些问题的新颖平行分布式算法。 MRADAM设计的关键原则是,在没有完美答案的情况下,人们可以做出合理的决定。实际上,考虑到最小支持的经典阈值和用户指定的错误绑定,MRADAM利用绑定到MINE“近似”频繁项目集的CHERNOFF在其实际支持上保证统计错误。通过利用MapReduce并行计算框架,从输入数据集的子集并行且独立地生成这些项目集。通过使用新颖技术来聚合和过滤来自每个子集的频繁项目集的结果集合,以提供输出的单个集合。 MRADAM可以很好地展示关于数据和数十机器的千兆字节,如实验证明的真实数据集。在实验中,我们还表明,所提出的算法返回精确结果的良好统计界限近似。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号