Fast Scalable k-means++ Algorithm with MapReduce

机译：使用MapReduce的快速可扩展k-means ++算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

K-means++ is undoubtedly one of the most important initializing algorithms for k-means owing to its provable approximation guarantee to the optimal solution. However, due to its sequential nature, k-means++ requires a large number of iterations to complete the initialization and it becomes inefficient as the size of data increase. Even though scalable k-means++ can drastically reduce the iterations and can be easily applied to the MapReduce systems, but due to its sequential nature, it still requires two MapReduce jobs in each round. Moreover, it takes a large number of I/O cost and it is time-consuming. In this paper, we propose Oversampling and Refining (OnR) method which can improve efficiency of scalable k-means++ by using only one MapReduce job to obtain　Ω(k) centers in each round. Except for the oversampling factor £ of scalable k-means++, OnR uses another oversampling factor o to further increase the number of chosen centers. Oversampling is executed on the Mapper phase, and in Reducer phase, one Reducer is responsible for removing the oversampled centers generated from o and outputs a set of centers which is the same as the output of scalable k-means++. To reduce the expensive network cost caused by too large o, OnR estimates the global cost by the local clustering cost and uses it to remove some wrong points in Mapper phase. Extensive experiments on real data are conducted and the performance results indicate that OnR outperforms scalable k-means++ in the aspect of I/O cost and running time.

机译：K-means ++无疑是k-means的最重要的初始化算法之一，因为它可证明对最优解的近似保证。但是，由于k-means ++的顺序性质，它需要进行大量迭代才能完成初始化，并且随着数据量的增加，k-means ++变得效率低下。尽管可伸缩的k-means ++可以大大减少迭代次数，并且可以轻松地应用于MapReduce系统，但是由于其顺序性质，它在每一轮中仍需要两个MapReduce作业。而且，这花费了大量的I / O成本，并且是耗时的。在本文中，我们提出了过采样和细化（OnR）方法，该方法可以通过仅使用一个MapReduce作业来获得每轮Ω（k）个中心，从而提高可伸缩k-means ++的效率。除了可伸缩k-means ++的过采样因子£外，OnR使用另一个过采样因子o进一步增加了所选中心的数量。在Mapper阶段执行过采样，在Reducer阶段，一个Reducer负责删除从o生成的过采样中心，并输出与可伸缩k-means ++输出相同的一组中心。为了减少因o太大而造成的昂贵网络成本，OnR会根据本地群集成本来估算全局成本，并使用它来消除Mapper阶段中的一些错误点。在真实数据上进行了广泛的实验，性能结果表明OnR在I / O成本和运行时间方面优于可扩展的k-means ++。

著录项

来源
《International conference on algorithms and architectures for parallel processing》|2014年|15-28|共14页
会议地点
作者
Yujie Xu; Wenyu Qu; Zhiyang Li; Changqing Ji; Yuanyuan Li; Yinan Wu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. MapReduce-based fast fuzzy c-means algorithm for large-scale underwater image segmentation [J] . Xiu Li, Jingdong Song, Fan Zhang, Future generation computer systems . 2016,第DECa期

机译：基于MapReduce的快速模糊c均值算法在大规模水下图像分割中的应用
2. On using MapReduce to scale algorithms for Big Data analytics: a case study [J] . Phongphun Kijsanayothin, Gantaphon Chalumporn, Rattikorn Hewett Journal of Big Data . 2019,第1期

机译：关于使用MapReduce扩展大数据分析算法的案例研究
3. Efficient MapReduce algorithms for triangle listing in billion-scale graphs [J] . Zhu Yuanyuan, Zhang Hao, Qin Lu, Distributed and Parallel Databases . 2017,第2期

机译：高效的MapReduce算法可在十亿比例的图形中列出三角形
4. Fast Scalable k-means++ Algorithm with MapReduce [C] . Yujie Xu, Wenyu Qu, Zhiyang Li, ICA3PP 2014 . 2014

机译：带MapReduce的快速可扩展k-means ++算法
5. Hardware Implementation and Performance Evaluation of K-Means and K-Means++ Clustering Algorithms [D] . Singh, Manisha . 2019

机译：K-Means和K-Means ++聚类算法的硬件实现和性能评估
6. Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment [O] . Bowen Meng, Guillem Pratx, Lei Xing -1

机译：在云计算环境中使用MapReduce超快速且可扩展的锥形束CT重建
7. Efficient k-means++ approximation with MapReduce [O] . Xu, Yujie, Qu, Wenyu, Li, Zhiyang, 2016

机译：使用MapReduce的高效k-means ++逼近
8. Solving the Protein Structure Prediction Problem With Fast Messy Genetic Algorithms (Scaling the Fast Messy Genetic Algorithm to Medium-Sized Peptides by Detecting Secondary Structures) [R] . Michaud, S. R. 2001

机译：用快速遗传算法求解蛋白质结构预测问题（通过检测二级结构将快速遗传算法扩展到中等大小的肽）

Fast Scalable k-means++ Algorithm with MapReduce

摘要

著录项

相似文献

相关主题

期刊订阅