Along with the explosive growth of data scale, how to explore the potential values of big data with clustering algorithm, such as K-means, now becomes a significant research topic. In combination of Canopy with K-means, the selection problem of center points, may be solved, and for the randomness of initial center point selection in canopy-K-means algorithm and the influence of noise on algorithm, a modified M-Canopy-Kmeans algorithm, improved by density peaks, is proposed, and with spark framework, parallel processing of the algorithm is realized. The experiments show that the algorithm exhibits great improvements in accuracy and noise immunity by effectively avoiding the blindness of Cannopy and noise point in samples. In addition, it shows great speed-up ratio and extensibility in Spark parallel framework.%随着数据规模的爆炸式增长,利用K-means等聚类算法挖掘大数据的潜在价值,已成为一个当前较为重要的研究方向.将Canopy算法与K-means算法结合,可解决K个中心点的选取问题.而针对Canopy-Kmeans算法中初始中心点选取随机、算法受噪声点影响等问题,提出了一种利用密度峰值改进的M-Canopy-Kmeans算法,并采用Spark框架实现算法的并行化.实验结果表明,改进后的算法避免了Canopy中心点选取的盲目性,且有效排除了样本中的噪声点,准确性、抗噪性都有明显提高,且在Spark并行框架中具有良好的加速比和扩展性.
展开▼