首页> 外文期刊>Information Systems >MR-SimLab: Scalable subgraph selection with label similarity for big data
【24h】

MR-SimLab: Scalable subgraph selection with label similarity for big data

机译:MR-SimLab:具有大标签相似性的可扩展子图选择

获取原文
获取原文并翻译 | 示例
       

摘要

With the increasing size and complexity of available databases, existing machine learning and data mining algorithms are facing a scalability challenge. In many applications, the number of features describing the data could be extremely high. This hinders or even could make any further exploration infeasible. In fact, many of these features are redundant or simply irrelevant. Hence, feature selection plays a key role in helping to overcome the problem of information overload especially in big data applications. Since many complex datasets could be modeled by graphs of interconnected labeled elements, in this work, we are particularly interested in feature selection for subgraph patterns. In this paper, we propose MR-SimLAB, a MAPREDucE-based approach for subgraph selection from large input subgraph sets. In many applications, it is easy to compute pairwise similarities between labels of the graph nodes. Our approach leverages such rich information to measure an approximate subgraph matching by aggregating the elementary label similarities between the matched nodes. Based on the aggregated similarity scores, our approach selects a small subset of informative representative subgraphs. We provide a distributed implementation of our algorithm on top of the MAPREDUCE framework that optimizes the computational efficiency of our approach for big data applications. We experimentally evaluate MR-SIMLAB on real datasets. The obtained results show that our approach is scalable and that the selected subgraphs are informative. (C) 2017 Elsevier Ltd. All rights reserved.
机译:随着可用数据库的规模和复杂性的增加,现有的机器学习和数据挖掘算法正面临可扩展性挑战。在许多应用中,描述数据的功能数量可能非常多。这阻碍甚至可能使任何进一步的探索都不可行。实际上,许多这些功能是多余的或根本不相关的。因此,特征选择在帮助克服信息超载问题(尤其是大数据应用程序)中起着关键作用。由于许多复杂的数据集可以通过相互连接的标记元素的图来建模,因此在这项工作中,我们对子图模式的特征选择特别感兴趣。在本文中,我们提出了MR-SimLAB,这是一种基于MAPREDucE的方法,用于从大型输入子图集中选择子图。在许多应用中,很容易计算图节点标签之间的成对相似度。我们的方法利用这些丰富的信息,通过汇总匹配节点之间的基本标签相似度来测量近似子图匹配。基于聚合的相似性得分,我们的方法选择了信息性代表性子图的一小部分。我们在MAPREDUCE框架之上提供了算法的分布式实现,可优化我们针对大数据应用的方法的计算效率。我们通过实验对真实数据集评估MR-SIMLAB。获得的结果表明,我们的方法是可扩展的,并且选定的子图具有参考价值。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号