The Parallelization and Optimization of K-means Algorithm Based on Spark

机译：基于Spark的K均值算法的并行化与优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Aiming at the deficiency of K-means clustering algorithm, Both the random selection of initial clustering center and the empirical determination of K value have a certain impact on k-means clustering results. A k-means clustering algorithm based on canopy algorithm and maximum and minimum distance is proposed. K-value is generated by canopy algorithm to avoid setting k-value artificially, The clustering center set was selected by using the weighted density method to reduce the impact of outliers on clustering results. Then the center point is selected by the maximum and minimum distance to avoid the clustering result falling into local optimum. The algorithm is parallelized on spark, Finally, the experimental results of UCI dataset show that the improved k-means algorithm not only improves the clustering quality, but also reduces the average iteration times of the algorithm. Experimental results show that the improved algorithm can effectively improve the efficiency and parallel computing ability of the algorithm.

机译：针对K-means聚类算法的不足，初始聚类中心的随机选择和K值的经验确定对k-means聚类结果都有一定的影响。提出了一种基于冠层算法和最大和最小距离的k均值聚类算法。为了避免人为设置k值，通过冠层算法生成K值，使用加权密度法选择聚类中心集，以减少离群值对聚类结果的影响。然后，通过最大和最小距离选择中心点，以避免聚类结果陷入局部最优。最后，UCI数据集的实验结果表明，改进的k-means算法不仅提高了聚类质量，而且减少了算法的平均迭代次数。实验结果表明，改进后的算法可以有效提高算法的效率和并行计算能力。

著录项

来源
《International Conference on Computer Science and Education》|2020年|457-462|共6页
会议地点
作者
Zitian Wang; Aibo Xu; ZiPeng Zhang; Chunzhi Wang; Aijun Liu; Xiang Hu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Clustering algorithms; Sparks; Machine learning algorithms; Distributed databases; Euclidean distance; Partitioning algorithms; Optimization;

机译：聚类算法;火花;机器学习算法;分布式数据库;欧氏距离;分区算法;优化;
入库时间 2022-08-26 13:54:48

相似文献

外文文献
中文文献
专利

1. Applying an Improved Elephant Herding Optimization Algorithm with Spark-based Parallelization to Feature Selection for Intrusion Detection [J] . Hui Xu, Qianqian Cao, Heng Fu, International Journal of Performability Engineering . 2019,第6期

机译：应用了一种改进的大象放牧优化算法与火花的并行化与入侵检测特征选择
2. A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce [J] . Zhuo Tang, Kunkun Liu, Jinbo Xiao, Concurrency and Computation . 2017,第20期

机译：基于冗余消除和极点优化的并行k均值聚类算法（MapReduce）
3. Implementation of hadoop optimization K-means parallel clustering algorithm [J] . Huang Suyu, Tan Lingli Basic & clinical pharmacology & toxicology. . 2020,第S9期

机译：Hadoop优化K-mears并行聚类算法的实现
4. Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark [C] . V. Santhi, Rini Jose International conference on distributed computing and internet technologies . 2018

机译：基于火花聚类的优化算法的并行K均值性能分析
5. Algorithms for VLSI circuit optimization and GPU-based parallelization. [D] . Liu, Yifang. 2010

机译：用于VLSI电路优化和基于GPU的并行化的算法。
6. ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use [O] . Piotr Kraj, Ashok Sharma, Nikhil Garge, 2008

机译：ParaKMeans：实现适用于一般实验室的并行化K均值算法
7. Applying an Improved Elephant Herding Optimization Algorithm with Spark-based Parallelization to Feature Selection for Intrusion Detection [O] . Hui Xu 2019

机译：应用了一种改进的大象扩大优化算法与火花的并行化与入侵检测特征选择

The Parallelization and Optimization of K-means Algorithm Based on Spark

摘要

著录项

相似文献

相关主题

期刊订阅