首页> 外文会议>International Conference on Advanced Cloud and Big Data >Parallelizing K-Means-Based Clustering on Spark
【24h】

Parallelizing K-Means-Based Clustering on Spark

机译:并行化基于K-Means的聚类在火花上

获取原文

摘要

K-means is in fact a family of clustering algorithms with different distance functions and a variety of extension, e.g., fuzzy clustering and consensus clustering. Nevertheless, K-means-based clustering algorithms employ the similar two-phase iterative procedure including distance computation and centroids updating. Therefore, to explore the parallel implementations of this two-phase iterative procedure on Spark is not only universal to a wealth of clustering algorithms but also meets the practical needs addressed by big data. This paper contributes to reveal implementation details for parallelizing K-means-based clustering on Spark. In particular, we first introduce the boundary of so-called K-means-based clustering, and then present the overall parallelizable framework on Spark. We discuss the technical barrier and their alternative strategies for each step. Experimental results on both large-scale UCI datasets and text datasets demonstrate the effectiveness and efficiency of our implementations.
机译:K-means实际上是一个具有不同距离功能的聚类算法和各种扩展,例如模糊聚类和共识聚类。然而,基于K-means的聚类算法采用类似的两相迭代过程,包括距离计算和质心更新。因此,为了探索这种两相迭代过程的并行实现,这不仅是普遍的聚类算法,而且符合大数据解决的实际需求。本文有助于揭示并行化基于K-Means的聚类的实施细节。特别是,我们首先介绍所谓的基于K-means的聚类的边界,然后在火花上呈现整体并行框架。我们讨论了每个步骤的技术障碍及其替代战略。大型UCI数据集和文本数据集的实验结果证明了我们实现的有效性和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号