首页> 外文会议>International Conference on Computer Science and Education >The Parallelization and Optimization of K-means Algorithm Based on Spark
【24h】

The Parallelization and Optimization of K-means Algorithm Based on Spark

机译:基于Spark的K均值算法的并行化与优化

获取原文

摘要

Aiming at the deficiency of K-means clustering algorithm, Both the random selection of initial clustering center and the empirical determination of K value have a certain impact on k-means clustering results. A k-means clustering algorithm based on canopy algorithm and maximum and minimum distance is proposed. K-value is generated by canopy algorithm to avoid setting k-value artificially, The clustering center set was selected by using the weighted density method to reduce the impact of outliers on clustering results. Then the center point is selected by the maximum and minimum distance to avoid the clustering result falling into local optimum. The algorithm is parallelized on spark, Finally, the experimental results of UCI dataset show that the improved k-means algorithm not only improves the clustering quality, but also reduces the average iteration times of the algorithm. Experimental results show that the improved algorithm can effectively improve the efficiency and parallel computing ability of the algorithm.
机译:针对K-means聚类算法的不足,初始聚类中心的随机选择和K值的经验确定对k-means聚类结果都有一定的影响。提出了一种基于冠层算法和最大和最小距离的k均值聚类算法。为了避免人为设置k值,通过冠层算法生成K值,使用加权密度法选择聚类中心集,以减少离群值对聚类结果的影响。然后,通过最大和最小距离选择中心点,以避免聚类结果陷入局部最优。最后,UCI数据集的实验结果表明,改进的k-means算法不仅提高了聚类质量,而且减少了算法的平均迭代次数。实验结果表明,改进后的算法可以有效提高算法的效率和并行计算能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号