首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce
【24h】

Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

机译:在MapReduce中对大型数据集进行聚类的有效分布式密度峰

获取原文
获取原文并翻译 | 示例

摘要

Density Peaks (DP) is a recently proposed clustering algorithm that has distinctive advantages over existing clustering algorithms. It has already been used in a wide range of applications. However, DP requires computing the distance between every pair of input points, therefore incurring quadratic computation overhead, which is prohibitive for large data sets. In this paper, we study efficient distributed algorithms for DP. We first show that a naïve MapReduce solution (Basic-DDP) has high communication and computation overhead. Then, we propose LSH-DDP, an approximate algorithm that exploits Locality Sensitive Hashing for partitioning data, performs local computation, and aggregates local results to approximate the final results. We address several challenges in employing LSH for DP. We leverage the characteristics of DP to deal with the fact that some of the result values cannot be directly approximated in local partitions. We present formal analysis of LSH-DDP, and show that the approximation quality and the runtime can be controlled by tuning the parameters of LSH-DDP. Experimental results on both a local cluster and EC2 show that LSH-DDP achieves a factor of 1.7–70x speedup over the naïve Basic-DDP and 2x speedup over the state-of-the-art EDDPC approach, while returning comparable cluster results. Compared to the popular K-means clustering, LSH-DDP also has comparable or better performance. Furthermore, LSH-DDP could achieve even higher efficiency with a lower accuracy requirement.
机译:密度峰值(DP)是最近提出的一种聚类算法,与现有的聚类算法相比具有明显的优势。它已经被广泛的应用。但是,DP要求计算每对输入点之间的距离,因此会产生二次计算开销,这对于大型数据集是不可行的。在本文中,我们研究了用于DP的高效分布式算法。我们首先表明,朴素的MapReduce解决方案(Basic-DDP)具有较高的通信和计算开销。然后,我们提出LSH-DDP,这是一种利用局部敏感哈希对数据进行分区,执行局部计算并汇总局部结果以近似最终结果的近似算法。我们解决了将LSH用于DP的几个挑战。我们利用DP的特性来处理某些结果值无法在本地分区中直接近似的事实。我们对LSH-DDP进行了形式化分析,结果表明,可以通过调整LSH-DDP的参数来控制近似质量和运行时间。在本地群集和EC2上的实验结果表明,LSH-DDP的速度是原始Basic-DDP的1.7到70倍,而最新的EDDPC方法是2倍,同时返回了可比的群集结果。与流行的K-means聚类相比,LSH-DDP也具有可比或更好的性能。此外,LSH-DDP可以以较低的精度要求实现更高的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号