首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging
【24h】

Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging

机译:结合分区算法和分层算法,通过内聚自合并实现健壮,高效的数据聚类

获取原文
获取原文并翻译 | 示例

摘要

Data clustering has attracted a lot of research attention in the field of computational statistics and data mining. In most related studies, the dissimilarity between two clusters is defined as the distance between their centroids or the distance between two closest (or farthest) data points However, all of these measures are vulnerable to outliers and removing the outliers precisely is yet another difficult task. In view of this, we propose a new similarity measure, referred to as cohesion, to measure the intercluster distances. By using this new measure of cohesion, we have designed a two-phase clustering algorithm, called cohesion-based self-merging (abbreviated as CSM), which runs in time linear to the size of input data set. Combining the features of partitional and hierarchical clustering methods, algorithm CSM partitions the input data set into several small subclusters in the first phase and then continuously merges the subclusters based on cohesion in a hierarchical manner in the second phase. The time and the space complexities of algorithm CSM are analyzed. As shown by our performance studies, the cohesion-based clustering is very robust and possesses excellent tolerance to outliers in various workloads. More importantly, algorithm CSM is shown to be able to cluster the data sets of arbitrary shapes very efficiently and provide better clustering results than those by prior methods.
机译:数据聚类在计算统计和数据挖掘领域引起了很多研究关注。在大多数相关研究中,两个聚类之间的差异被定义为它们的质心之间的距离或两个最近(或最远)数据点之间的距离。但是,所有这些度量都容易受到异常值的影响,而准确地消除异常值是另一项艰巨的任务。有鉴于此,我们提出了一种新的相似性度量,称为内聚,以度量集群之间的距离。通过使用这种新的内聚度量,我们设计了一种两阶段聚类算法,称为基于内聚的自合并(缩写为CSM),其时间与输入数据集的大小呈线性关系。结合分区聚类和分层聚类方法的特征,算法CSM在第一阶段将输入数据集划分为几个小子集群,然后在第二阶段基于内聚力以层次方式连续合并子集群。分析了算法CSM的时间和空间复杂度。如我们的性能研究所示,基于内聚的聚类非常健壮,并且对各种工作负载中的异常值具有出色的容忍度。更重要的是,与以前的方法相比,算法CSM被证明能够非常有效地对任意形状的数据集进行聚类,并提供更好的聚类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号