...
首页> 外文期刊>Journal of Parallel and Distributed Computing >A distributed approximate nearest neighbors algorithm for efficient large scale mean shift clustering
【24h】

A distributed approximate nearest neighbors algorithm for efficient large scale mean shift clustering

机译:一种有效的大规模均值漂移聚类的分布式近似最近邻算法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Mean Shift clustering, as a generalization of the well-known k-means clustering, computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential for improved clustering accuracy, the Mean Shift approach is a computationally expensive method for unsupervised learning. We introduce two contributions aiming to provide approximate Mean Shift clustering, based on scalable procedures to compute the density gradient ascent and cluster labeling, with a linear time complexity, as opposed to the quadratic time complexity for the exact clustering. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. When implemented on a serial system, these approximate methods can be used for moderate sized datasets. To facilitate the analysis of Big Data, a distributed implementation, written for the Spark/Scala ecosystem is proposed. An added benefit is that our proposed approximations of the density gradient ascent, when used as a pre-processing step in other clustering methods, can also improve the clustering accuracy of the latter. We present experimental results illustrating the effect of tuning parameters on cluster labeling accuracy and execution times, as well as the potential to solve concrete problems in Big Clustering. (C) 2019 Elsevier Inc. All rights reserved.
机译:均值漂移聚类是众所周知的k均值聚类的一种概括,它计算任意形状的聚类,定义为对密度梯度上升路径所创建的局部模式的吸引盆。尽管均值漂移方法具有提高聚类精度的潜力,但它是无监督学习的计算昂贵方法。我们基于可伸缩程序计算密度梯度上升和聚类标记,引入线性时间复杂度,而不是精确聚类的二次时间复杂度,旨在提供近似均值漂移聚类的两个贡献。这两个命题都是基于局部敏感哈希(LSH)来近似最近的邻居。当在串行系统上实现时,这些近似方法可用于中等大小的数据集。为了促进对大数据的分析,提出了为Spark / Scala生态系统编写的分布式实现。另外一个好处是,我们提出的密度梯度上升的近似值在其他聚类方法中用作预处理步骤时,也可以提高后者的聚类精度。我们提供的实验结果说明了调整参数对集群标签准确性和执行时间的影响,以及解决大集群中具体问题的潜力。 (C)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号