...
首页> 外文期刊>Progress in Artificial Intelligence >Undersampled K-means approach for handling imbalanced distributed data
【24h】

Undersampled K-means approach for handling imbalanced distributed data

机译:欠采样K均值方法用于处理不平衡的分布式数据

获取原文
获取原文并翻译 | 示例
           

摘要

K-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of K-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect”. In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the K-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known Kmeans algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class.We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluationmetrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
机译:K均值是一种众所周知的分区聚类技术,因其计算成本低而被广泛使用。但是,K-means算法的性能往往会受到偏斜的数据分布(即不平衡数据)的影响。即使输入数据具有变化的簇大小,它们通常也会产生大小相对一致的簇,这被称为“均匀效应”。在本文中,我们分析了这种影响的原因,并说明了它可能在K均值聚类过程中发生的更多。随着少数族裔人数的减少,“统一效应”变得明显。为了防止“均匀效应”的影响,我们重新访问了著名的Kmeans算法,并提供了一种适当地对不平衡分布数据进行聚类的通用方法。该算法由一种新颖的欠采样技术组成,该方法通过智能地去除多数类中的嘈杂和弱实例而实现。我们使用来自五个不同应用领域的十二个UCI数据集进行实验,并使用五种算法对八个评估指标进行了比较。实验结果证明了该算法在聚类平衡和不平衡数据中的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号