首页> 外文期刊>Emerging Topics in Computing, IEEE Transactions on >A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis
【24h】

A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis

机译:大数据聚类算法研究:分类法和实证分析

获取原文
获取原文并翻译 | 示例
           

摘要

Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics. There is a vast body of knowledge in the area of clustering and there has been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this paper introduces concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as providing a comparison, both from a theoretical and an empirical perspective. From a theoretical perspective, we developed a categorizing framework based on the main properties pointed out in previous studies. Empirically, we conducted extensive experiments where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. In addition, we highlighted the set of clustering algorithms that are the best performing for big data.
机译:聚类算法已经成为一种可替代的功能强大的元学习工具,可以准确地分析现代应用程序生成的大量数据。特别是,它们的主要目标是将数据分类到群集中,以便根据特定指标将对象相似时将它们分组到同一群集中。在集群领域有大量的知识,并且已经尝试对它们进行分析和分类以用于更多的应用程序。但是,将聚类算法用于大数据的主要问题之一是引起从业者之间的困惑,这是在其属性的定义上缺乏共识以及缺乏正式的分类。为了缓解这些问题,本文从理论和经验的角度介绍了与聚类相关的概念和算法,对现有(聚类)算法进行了简要概述,并提供了比较。从理论上讲,我们根据先前研究中指出的主要属性开发了一个分类框架。根据经验,我们进行了广泛的实验,其中我们使用了大量的真实(大)数据集比较了每个类别中最具代表性的算法。候选聚类算法的有效性通过许多内部和外部有效性指标,稳定性,运行时和可伸缩性测试进行衡量。此外,我们重点介绍了最适用于大数据的一组聚类算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号