High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs

Ali Hadian; Saeed Shahrivari

首页> 外文期刊>Journal of supercomputing >High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs

【24h】

High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs

机译：针对多核CPU上磁盘驻留数据集的高性能并行k均值聚类

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. k-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. k-means++) and stream-based algorithms. Also, the quality of solution is often equal to k-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.

机译：如今，海量数据集的聚类是许多数据分析任务的关键部分。在大数据上使用时，大多数可用的聚类算法都有两个缺点：（1）大量的聚类算法，例如k均值必须将数据保留在内存中并对其进行多次迭代，这对于大型数据集而言非常昂贵，（2）在有限内存大小下运行的聚类算法，尤其是流集群算法家族，没有使用现代多核处理器的并行实现，而且它们缺乏令人满意的结果质量。在本文中，我们提出了一种将并行聚类与单遍流聚类算法相结合的算法。目的是制定一种利用常规多核PC的最大功能来对数据集进行聚类的聚类算法，同时获得可接受的聚类质量。我们的想法是将数据拆分为多个块，然后将每个块聚集在一个单独的线程中。然后，从块中提取的集群在最后阶段使用重新聚类进行聚合。可以根据硬件限制来调整算法的参数。在12核计算机上的实验结果表明，该方法比其批处理等效方法（例如k-means ++）和基于流的算法要快得多。同样，解决方案的质量通常等于k-means ++，尽管它在流聚类算法中占主导地位。我们的解决方案还可以通过额外的可用内核很好地扩展，因此提供了一种有效，快速的解决方案，可以在多核和多处理器系统上对大型数据集进行聚类。

著录项

来源
《Journal of supercomputing》 |2014年第2期|845-863|共19页
作者
Ali Hadian; Saeed Shahrivari;
展开▼
作者单位

Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran;

Department of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Clustering; k-means; Parallel algorithms; Data mining; Big data;

机译：集群;k均值并行算法数据挖掘;大数据;

相似文献

外文文献
中文文献
专利

1. Performance enhancement of a dynamic K-means algorithm through a parallel adaptive strategy on multicore CPUs [J] . Giuliano Laccetti, Marco Lapegna, Valeria Mele, Journal of Parallel and Distributed Computing . 2020,第Nova期

机译：通过对多核CPU的并行自适应策略进行动态k均值算法的性能增强
2. Genetic Algorithm Based Dimensionality Reduction for Improving Performance of K-Means Clustering: A Case Study for Categorization of Medical Dataset [J] . Asha Gowda Karegowda, Vidya T. Shama, M.A. Jayaram, International journal of soft computing . 2012,第5期

机译：基于遗传算法的降维方法提高K-Means聚类性能：以医学数据集分类为例
3. Genetic Algorithm Based Dimensionality Reduction for Improving Performance of K-Means Clustering: A Case Study for Categorization of Medical Dataset [J] . Asha Gowda Karegowda, Vidya T. Shama, M.A. Jayaram, International journal of soft computing . 2012,第5期

机译：基于遗传算法的降维方法提高K-Means聚类性能：以医学数据集分类为例
4. A High Performance Modified K-Means Algorithm for Dynamic Data Clustering in Multi-core CPUs Based Environments [C] . Giuliano Laccetti, Marco Lapegna, Valeria Mele, International Conference on Internet and Distributed Computing Systems . 2019

机译：基于多核CPU的动态数据聚类的高性能改进K均值算法
5. Visual data mining: Using parallel coordinate plots with K-means clustering and color to find correlations in a multidimensional dataset. [D] . Peterson, Angela R. 2009

机译：可视数据挖掘：使用具有K均值聚类和颜色的平行坐标图来查找多维数据集中的相关性。
6. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs GPUs and MICs: A Case Study with Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Guilherme Andrade, -1

机译：具有多核CPUGPU和MIC的系统上的应用程序性能分析和高效执行：以显微镜图像分析为例
7. Analysis of Simple K-Mean and Parallel K-Mean Clustering for Software Products and Organizational Performance Using Education Sector Dataset [O] . Rui Shang, Balqees Ara, Islam Zada, 2021

机译：使用教育部门数据集分析软件产品和组织绩效的简单K均值和平行k平均聚类

High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅