【24h】

Fast K-Means Algorithm Clustering

机译:快速K均值算法聚类

获取原文
           

摘要

k-means has recently been recognized as one of the best algorithms for clustering unsupervised data. Since k-means depends mainly on distance calculation between all data points and the centers, the time cost will be high when the size of the dataset is large (for example more than 500millions of points). We propose a two stage algorithm to reduce the time cost of distance calculation for huge datasets. The first stage is a fast distance calculation using only a small portion of the data to produce the best possible location of the centers. The second stage is a slow distance calculation in which the initial centers used are taken from the first stage. The fast and slow stages represent the speed of the movement of the centers. In the slow stage, the whole dataset can be used to get the exact location of the centers. The time cost of the distance calculation for the fast stage is very low due to the small size of the training data chosen. The time cost of the distance calculation for the slow stage is also minimized due to small number of iterations. Different initial locations of the clusters have been used during the test of the proposed algorithms. For large datasets, experiments show that the 2-stage clustering method achieves better speed-up (1-9 times).
机译:最近,k-means被公认为是对无监督数据进行聚类的最佳算法之一。由于k均值主要取决于所有数据点与中心之间的距离计算,因此,当数据集的大小较大(例如,超过5亿个点)时,时间成本将很高。我们提出了一种两阶段算法来减少大型数据集距离计算的时间成本。第一阶段是快速距离计算,仅使用一小部分数据以产生最佳的中心位置。第二阶段是慢距离计算,其中使用的初始中心取自第一阶段。快速和慢速阶段代表中心移动的速度。在慢速阶段,可以使用整个数据集获取中心的确切位置。由于所选训练数据的大小较小,因此快速阶段的距离计算的时间成本非常低。由于迭代次数少,用于慢速阶段的距离计算的时间成本也被最小化。在对提出的算法进行测试期间,使用了群集的不同初始位置。对于大型数据集,实验表明,两阶段聚类方法可实现更好的提速(1-9倍)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号