A Parallel Clustering Algorithm Implementation Based on Apache Mahout

机译：基于Apache Mahout的并行聚类算法实现

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

K-means clustering is one of the most famous clustering algorithms. It is widely used in many practical applications. K-means clustering is the task of dividing a set of n data points in d-dimensional space into k clusters. The data points in the same cluster are much closer to each other than to those in other clusters according to certain criteria. Traditional k-means clustering proceeds by alternatively executing two steps: assignment step and update step. The assignment step assigns each data point to its nearest cluster. The Euclidean distance is commonly used to measure the distance. The update step calculates the new center of each cluster and updates them. For large-scale dataset, the k-means clustering spends most of its execution time on calculating distances between each data point and existing cluster centers. It is obvious that distance computation for each data point is irrelevant to the others. Therefore these distance calculations can be completed concurrently. In this paper, a simple and efficient implementation of a parallel k-means clustering algorithm is proposed based on the existing mahout API, in order to speed up clustering for large-scale dataset. In addition, the implementation was packaged and can be offered as an easy to use API for developers who can easily accomplish their task without any other configurations. Experimental results revealed a significant improvement in clustering speed for large-scale dataset. It demonstrates the effectiveness and efficiency of the proposed implementation.

机译：K-均值聚类是最著名的聚类算法之一。它被广泛用于许多实际应用中。 K均值聚类是将d维空间中的一组n个数据点划分为k个聚类的任务。根据某些标准，同一群集中的数据点比其他群集中的数据点彼此更接近。传统的k均值聚类是通过交替执行两个步骤来进行的：分配步骤和更新步骤。分配步骤将每个数据点分配给其最近的群集。欧几里得距离通常用于测量距离。更新步骤将计算每个群集的新中心并进行更新。对于大型数据集，k均值聚类将其大部分执行时间用于计算每个数据点与现有聚类中心之间的距离。显然，每个数据点的距离计算与其他数据点无关。因此，这些距离计算可以同时完成。本文基于现有的mahout API，提出了一种简单有效的并行k均值聚类算法，以加速大规模数据集的聚类。此外，该实现已打包，可以作为易于使用的API提供给无需任何其他配置即可轻松完成任务的开发人员。实验结果表明，大规模数据集的聚类速度有了显着提高。它证明了所提议实施的有效性和效率。

著录项

来源
《International Conference on Instrumentation Measurement, Computer, Communication and Control》|2016年|790-795|共6页
会议地点
作者
Xia Daoping; Zhong Alin; Long Yubo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Clustering algorithms; Algorithm design and analysis; Partitioning algorithms; Classification algorithms; Parallel algorithms; Euclidean distance; Software algorithms;

机译：聚类算法;算法设计与分析;分区算法;分类算法;并行算法;欧氏距离;软件算法;

相似文献

外文文献
中文文献
专利

1. Explorations of the implementation of a parallel IDW interpolation algorithm in a Linux cluster-based parallel GIS [J] . Fang Huang, Dingsheng Liu, Xicheng Tan, Computers & geosciences . 2011,第4期

机译：在基于Linux集群的并行GIS中实现并行IDW插值算法的探索
2. Methodology and optimization for implementing cluster-based parallel geospatial algorithms with a case study [J] . Huang Fang, Tie Bo, Tao Jian, Cluster computing . 2020,第2期

机译：用案例研究实现基于集群的并行地理空间算法的方法和优化
3. Automatic parallelization of representative-based clustering algorithms for multicore cluster systems [J] . Saiyedul Islam, Sundar Balasubramaniam, Shruti Gupta, International Journal of Data Science and Analytics . 2020,第2期

机译：用于多核群集系统的基于代表性聚类算法的自动并行化
4. A Parallel Clustering Algorithm Implementation Based on Apache Mahout [C] . Xia Daoping, Zhong Alin, Long Yubo International Conference on Instrumentation and Measurement, Computer, Communication and Control . 2016

机译：基于Apache Mahout的并行聚类算法实现
5. Parallel implementation and benchmarking in cluster architectures of one-dimensional discrete fourier transforms: A comparison using the row-column algorithm versus a novel formulation based on the bluestein/pseudocirculant algorithm. [D] . Velez Rodriguez, William. 2014

机译：一维离散傅里叶变换的群集体系结构中的并行实现和基准测试：使用行列算法与基于bluestein / pseudocirculant算法的新颖公式进行比较。
6. Highly efficient and exact method for parallelization of grid-based algorithms and its implementation in DelPhi [O] . Chuan Li, Lin Li, Jie Zhang, -1

机译：基于网格的算法并行化的高效和精确方法及其在Delphi中的实现
7. Applying compression algorithms on hadoop cluster implementing through apache tez and hadoop mapreduce [O] . Dr E. Laxmi Lydia, M Srinivasa Rao 2018

机译：应用Apache Tez和Hadoop MapReduce实现Hadoop集群的压缩算法
8. Implementation of Novel Parallel Cyclic Convolution Algorithms in Clusters and Multi-Core Architectures. [R] . Teixeira, M., Nevarez, F. 2014

机译：集群和多核架构中新型并行循环卷积算法的实现。

A Parallel Clustering Algorithm Implementation Based on Apache Mahout

摘要

著录项

相似文献

相关主题

期刊订阅