首页> 外文会议>International conference on advances in computing, communications and informatics >Exploiting Apache Flink's iteration capabilities for distributed Apriori: Community detection problem as an example
【24h】

Exploiting Apache Flink's iteration capabilities for distributed Apriori: Community detection problem as an example

机译:利用Apache Flink的分布式Apriori迭代功能:以社区检测问题为例

获取原文

摘要

Extraction of useful information from large datasets is one of the most important research problem. Association rule mining is one of the best methods for this purpose. Finding possible associations between items in large transaction based datasets (finding frequent patterns) is most important part of the association rule mining. There exists many algorithms to find frequent patterns but Apriori algorithm always remains a preferred choice due to its ease of implementation and natural tendency to be parallelized. Many single-machine based Apriori variants exist but massive amount of data available these days is above capacity of a single machine. Therefore, to meet the demands of this ever-growing huge data, there is a need of multiple machines based Apriori algorithm. For these type of distributed applications, mapreduce is a popular fault-tolerant framework. Hadoop is one of the best open-source software framework with mapreduce approach for distributed storage and distributed processing of huge datasets using clusters built from commodity hardware. But heavy disk I/O operation at each iteration of a highly iterative algorithm like Apriori makes hadoop inefficient. A number of map reduce based platforms are being developed for parallel computing in recent years. Among them, two platforms, namely, Spark and Flink have attracted lot of attention because of their inbuilt support to distributed computations. Earlier we had proposed a reduced-Apriori algorithm on Spark platform which outperforms parallel Apriori, firstly because of use of Spark and secondly because of the improvement we proposed in standard Apriori. Therefore, present work is a natural sequel of our earlier work and targets on implementing, testing and benchmarking Apriori on Apache Flink and compares it with Spark implementation. We conduct in-depth experiments to gain insight into the effectiveness, efficiency and scalability of the Apriori algorithm on Flink. We also use community detection graph mining problem as a test case to demonstrate our implementations.
机译:从大型数据集中提取有用信息是最重要的研究问题之一。关联规则挖掘是达到此目的的最佳方法之一。在基于大型交易的数据集中查找项目之间的可能关联(查找频繁模式)是关联规则挖掘的最重要部分。有很多算法可以找到频繁的模式,但是Apriori算法始终易于实现,并且易于并行化,因此始终是首选算法。存在许多基于单机的Apriori变体,但如今可用的大量数据超过了单机的容量。因此,为了满足这种不断增长的海量数据的需求,需要基于多机器的Apriori算法。对于这些类型的分布式应用程序,mapreduce是一种流行的容错框架。 Hadoop是使用mapreduce方法的最佳开源软件框架之一,可使用从商品硬件构建的集群对大型数据集进行分布式存储和分布式处理。但是在像Apriori这样的高度迭代算法的每次迭代中,繁重的磁盘I / O操作都会使hadoop效率低下。近年来,许多基于Map Reduce的平台正在开发用于并行计算。其中,Spark和Flink这两个平台由于对分布式计算的内置支持而备受关注。较早之前,我们在Spark平台上提出了一个简化的Apriori算法,其性能优于并行Apriori,首先是因为使用了Spark,其次是因为我们在标准Apriori中提出了改进。因此,当前的工作是我们早期工作的自然结果,其目标是在Apache Flink上实现,测试和基准化Apriori,并将其与Spark实现进行比较。我们进行了深入的实验,以了解Flink上Apriori算法的有效性,效率和可扩展性。我们还使用社区检测图挖掘问题作为测试案例来演示我们的实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号