To further enhance the efficiency of K-means clustering algorithm for large-scale data, combined with MapReduce computational model, a parallel clustering method is proposed, which uses Hash function to extract samples and then obtains initial center by Pam algorithm.The sample extracted by Hash function can fully reflect the statistical characteristics of the data, using Pam algorithm to obtain the initial clustering center, and improve the traditional clustering algorithm to rely on the initial center of the problem.It uses the Pam algorithm to obtain the initial clustering center, and improves the problem of that the traditional clustering algorithms rely on the initial center.The experimental results show that the proposed algorithm can effectively improve the clustering quality and efficiency, and is suitable for the clustering analysis of large-scale data.%为进一步提高K-means算法对大规模数据聚类的效率,结合MapReduce计算模型,提出一种先利用Hash函数进行样本抽取,再利用Pam算法获取初始中心的并行聚类方法.通过Hash函数抽取的样本能充分反映数据的统计特性,使用Pam算法获取初始聚类中心,改善了传统聚类算法依赖初始中心的问题.实验结果表明该算法有效提高了聚类质量和执行效率,适用于对大规模数据的聚类分析.
展开▼