首页> 外文期刊>Journal of supercomputing >High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs
【24h】

High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs

机译:针对多核CPU上磁盘驻留数据集的高性能并行k均值聚类

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. k-means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. k-means++) and stream-based algorithms. Also, the quality of solution is often equal to k-means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.
机译:如今,海量数据集的聚类是许多数据分析任务的关键部分。在大数据上使用时,大多数可用的聚类算法都有两个缺点:(1)大量的聚类算法,例如k均值必须将数据保留在内存中并对其进行多次迭代,这对于大型数据集而言非常昂贵,(2)在有限内存大小下运行的聚类算法,尤其是流集群算法家族,没有使用现代多核处理器的并行实现,而且它们缺乏令人满意的结果质量。在本文中,我们提出了一种将并行聚类与单遍流聚类算法相结合的算法。目的是制定一种利用常规多核PC的最大功能来对数据集进行聚类的聚类算法,同时获得可接受的聚类质量。我们的想法是将数据拆分为多个块,然后将每个块聚集在一个单独的线程中。然后,从块中提取的集群在最后阶段使用重新聚类进行聚合。可以根据硬件限制来调整算法的参数。在12核计算机上的实验结果表明,该方法比其批处理等效方法(例如k-means ++)和基于流的算法要快得多。同样,解决方案的质量通常等于k-means ++,尽管它在流聚类算法中占主导地位。我们的解决方案还可以通过额外的可用内核很好地扩展,因此提供了一种有效,快速的解决方案,可以在多核和多处理器系统上对大型数据集进行聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号