首页> 外文会议>IEEE international conference on data engineering >Large-scale frequent subgraph mining in MapReduce
【24h】

Large-scale frequent subgraph mining in MapReduce

机译:MapReduce中的大规模频繁子图挖掘

获取原文
获取外文期刊封面目录资料

摘要

Mining frequent subgraphs from a large collection of graph objects is an important problem in several application domains such as bio-informatics, social networks, computer vision, etc. The main challenge in subgraph mining is efficiency, as (i) testing for graph isomorphisms is computationally intensive, and (ii) the cardinality of the graph collection to be mined may be very large. We propose a two-step filter-and-refinement approach that is suitable to massive parallelization within the scalable MapReduce computing model. We partition the collection of graphs among worker nodes, and each worker applies the filter step to determine a set of candidate subgraphs that are locally frequent in its partition. The union of all such graphs is the input to the refinement step, where each candidate is checked against all partitions and only the globally frequent graphs are retained. We devise a statistical threshold mechanism that allows us to predict which subgraphs have a high chance to become globally frequent, and thus reduce the computational overhead in the refinement step. We also propose effective strategies to avoid redundant computation in each round when searching for candidate graphs, as well as a lightweight graph compression mechanism to reduce the communication cost between machines. Extensive experimental evaluation results on several real-world large graph datasets show that the proposed approach clearly outperforms the existing state-of-the-art and provides a practical solution to the problem of frequent subgraph mining for massive collections of graphs.
机译:从大量图对象中挖掘频繁的子图是几个应用领域(例如生物信息学,社交网络,计算机视觉等)中的重要问题。子图挖掘的主要挑战是效率,因为(i)测试图同构是计算密集型;(ii)要挖掘的图形集合的基数可能非常大。我们提出了一种两步式的过滤和细化方法,该方法适用于可伸缩MapReduce计算模型中的大规模并行化。我们在工作程序节点之间划分图的集合,每个工作程序都应用过滤步骤来确定一组在其分区中本地频繁出现的候选子图。所有这些图的并集是细化步骤的输入,在该步骤中,将针对所有分区检查每个候选项,并且仅保留全局频繁使用的图。我们设计了一种统计阈值机制,该机制使我们能够预测哪些子图很有可能变得全局频繁,从而减少了优化步骤中的计算开销。我们还提出了有效的策略,以避免在搜索候选图时在每一轮中进行多余的计算,并提出了一种轻量级的图压缩机制来减少机器之间的通信成本。在几个实际的大型图形数据集上进行的广泛实验评估结果表明,所提出的方法明显优于现有的最新技术,并为大量图形的频繁子图挖掘问题提供了实用的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号