Parallelizing K-Means-Based Clustering on Spark

机译：并行化基于K-Means的聚类在火花上

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

K-means is in fact a family of clustering algorithms with different distance functions and a variety of extension, e.g., fuzzy clustering and consensus clustering. Nevertheless, K-means-based clustering algorithms employ the similar two-phase iterative procedure including distance computation and centroids updating. Therefore, to explore the parallel implementations of this two-phase iterative procedure on Spark is not only universal to a wealth of clustering algorithms but also meets the practical needs addressed by big data. This paper contributes to reveal implementation details for parallelizing K-means-based clustering on Spark. In particular, we first introduce the boundary of so-called K-means-based clustering, and then present the overall parallelizable framework on Spark. We discuss the technical barrier and their alternative strategies for each step. Experimental results on both large-scale UCI datasets and text datasets demonstrate the effectiveness and efficiency of our implementations.

机译：K-means实际上是一个具有不同距离功能的聚类算法和各种扩展，例如模糊聚类和共识聚类。然而，基于K-means的聚类算法采用类似的两相迭代过程，包括距离计算和质心更新。因此，为了探索这种两相迭代过程的并行实现，这不仅是普遍的聚类算法，而且符合大数据解决的实际需求。本文有助于揭示并行化基于K-Means的聚类的实施细节。特别是，我们首先介绍所谓的基于K-means的聚类的边界，然后在火花上呈现整体并行框架。我们讨论了每个步骤的技术障碍及其替代战略。大型UCI数据集和文本数据集的实验结果证明了我们实现的有效性和效率。

著录项

来源
《International Conference on Advanced Cloud and Big Data》|2016年|xviii 337 p. :|共6页
会议地点
作者
Bowen Wang; Jun Yin; Qi Hua; Zhiang Wu; Jie Cao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词
pattern clustering; iterative methods; parallel programming;

机译：模式聚类;迭代方法;并行编程;

相似文献

外文文献
中文文献
专利

1. initKmix-A novel initial partition generation algorithm for clustering mixed data using k-means-based clustering [J] . Ahmad Amir, Khan Shehroz S. Expert systems with applications . 2021,第Apra期

机译：initkmix-一种新颖的初始分区生成算法，用于使用基于k均值的群集聚类混合数据
2. Greedy Optimization for K-Means-Based Consensus Clustering [J] . Xue Li, Hongfu Liu 清华大学学报（英文版） . 2018,第002期

机译：基于K均值的共识聚类的贪婪优化
3. Generalized k-means-based clustering for temporal data under weighted and kernel time warp [J] . Soheily-Khah Saeid, Douzal-Chouakria Ahlame, Gaussier Eric Pattern recognition letters . 2016,第maya1期

机译：加权和核时间扭曲下基于通用k均值的时间数据聚类
4. Parallelizing K-Means-Based Clustering on Spark [C] . Bowen Wang, Jun Yin, Qi Hua, International Conference on Advanced Cloud and Big Data . 2016

机译：在Spark上并行基于K均值的聚类
5. K-means-based Consensus Clustering: Algorithms, Theory and Applications [D] . Liu, Hongfu. 2018

机译：基于K-means的共识聚类：算法，理论和应用
6. A Parallel Computing Approach to Spatial Neighboring Analysis of Large Amounts of Terrain Data Using Spark [O] . Jianbo Zhang, Zhuangzhuang Ye, Kai Zheng 2021

机译：使用火花的大量地形数据的空间相邻分析的平行计算方法
7. Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform [O] . Fang Huang, Qiang Zhu, Ji Zhou, 2017

机译：基于spark平台的DBsCaN空间数据挖掘聚类算法并行化研究

Parallelizing K-Means-Based Clustering on Spark

摘要

著录项

相似文献

相关主题

期刊订阅