【24h】

Fast Scalable k-means++ Algorithm with MapReduce

机译:使用MapReduce的快速可扩展k-means ++算法

获取原文

摘要

K-means++ is undoubtedly one of the most important initializing algorithms for k-means owing to its provable approximation guarantee to the optimal solution. However, due to its sequential nature, k-means++ requires a large number of iterations to complete the initialization and it becomes inefficient as the size of data increase. Even though scalable k-means++ can drastically reduce the iterations and can be easily applied to the MapReduce systems, but due to its sequential nature, it still requires two MapReduce jobs in each round. Moreover, it takes a large number of I/O cost and it is time-consuming. In this paper, we propose Oversampling and Refining (OnR) method which can improve efficiency of scalable k-means++ by using only one MapReduce job to obtain Ω(k) centers in each round. Except for the oversampling factor £ of scalable k-means++, OnR uses another oversampling factor o to further increase the number of chosen centers. Oversampling is executed on the Mapper phase, and in Reducer phase, one Reducer is responsible for removing the oversampled centers generated from o and outputs a set of centers which is the same as the output of scalable k-means++. To reduce the expensive network cost caused by too large o, OnR estimates the global cost by the local clustering cost and uses it to remove some wrong points in Mapper phase. Extensive experiments on real data are conducted and the performance results indicate that OnR outperforms scalable k-means++ in the aspect of I/O cost and running time.
机译:K-means ++无疑是k-means的最重要的初始化算法之一,因为它可证明对最优解的近似保证。但是,由于k-means ++的顺序性质,它需要进行大量迭代才能完成初始化,并且随着数据量的增加,k-means ++变得效率低下。尽管可伸缩的k-means ++可以大大减少迭代次数,并且可以轻松地应用于MapReduce系统,但是由于其顺序性质,它在每一轮中仍需要两个MapReduce作业。而且,这花费了大量的I / O成本,并且是耗时的。在本文中,我们提出了过采样和细化(OnR)方法,该方法可以通过仅使用一个MapReduce作业来获得每轮Ω(k)个中心,从而提高可伸缩k-means ++的效率。除了可伸缩k-means ++的过采样因子£外,OnR使用另一个过采样因子o进一步增加了所选中心的数量。在Mapper阶段执行过采样,在Reducer阶段,一个Reducer负责删除从o生成的过采样中心,并输出与可伸缩k-means ++输出相同的一组中心。为了减少因o太大而造成的昂贵网络成本,OnR会根据本地群集成本来估算全局成本,并使用它来消除Mapper阶段中的一些错误点。在真实数据上进行了广泛的实验,性能结果表明OnR在I / O成本和运行时间方面优于可扩展的k-means ++。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号