Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

Yanfeng Zhang; Shimin Chen; Ge Yu

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

【24h】

Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

机译：在MapReduce中对大型数据集进行聚类的有效分布式密度峰

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Density Peaks (DP) is a recently proposed clustering algorithm that has distinctive advantages over existing clustering algorithms. It has already been used in a wide range of applications. However, DP requires computing the distance between every pair of input points, therefore incurring quadratic computation overhead, which is prohibitive for large data sets. In this paper, we study efficient distributed algorithms for DP. We first show that a naïve MapReduce solution (Basic-DDP) has high communication and computation overhead. Then, we propose LSH-DDP, an approximate algorithm that exploits Locality Sensitive Hashing for partitioning data, performs local computation, and aggregates local results to approximate the final results. We address several challenges in employing LSH for DP. We leverage the characteristics of DP to deal with the fact that some of the result values cannot be directly approximated in local partitions. We present formal analysis of LSH-DDP, and show that the approximation quality and the runtime can be controlled by tuning the parameters of LSH-DDP. Experimental results on both a local cluster and EC2 show that LSH-DDP achieves a factor of 1.7–70x speedup over the naïve Basic-DDP and 2x speedup over the state-of-the-art EDDPC approach, while returning comparable cluster results. Compared to the popular K-means clustering, LSH-DDP also has comparable or better performance. Furthermore, LSH-DDP could achieve even higher efficiency with a lower accuracy requirement.

机译：密度峰值（DP）是最近提出的一种聚类算法，与现有的聚类算法相比具有明显的优势。它已经被广泛的应用。但是，DP要求计算每对输入点之间的距离，因此会产生二次计算开销，这对于大型数据集是不可行的。在本文中，我们研究了用于DP的高效分布式算法。我们首先表明，朴素的MapReduce解决方案（Basic-DDP）具有较高的通信和计算开销。然后，我们提出LSH-DDP，这是一种利用局部敏感哈希对数据进行分区，执行局部计算并汇总局部结果以近似最终结果的近似算法。我们解决了将LSH用于DP的几个挑战。我们利用DP的特性来处理某些结果值无法在本地分区中直接近似的事实。我们对LSH-DDP进行了形式化分析，结果表明，可以通过调整LSH-DDP的参数来控制近似质量和运行时间。在本地群集和EC2上的实验结果表明，LSH-DDP的速度是原始Basic-DDP的1.7到70倍，而最新的EDDPC方法是2倍，同时返回了可比的群集结果。与流行的K-means聚类相比，LSH-DDP也具有可比或更好的性能。此外，LSH-DDP可以以较低的精度要求实现更高的效率。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2016年第12期|3218-3230|共13页
作者
Yanfeng Zhang; Shimin Chen; Ge Yu;
展开▼
作者单位

School of Computer Science and Engineering, Northeastern University, Shenyang, China;

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;

School of Computer Science and Engineering, Northeastern University, Shenyang, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Clustering algorithms; Approximation algorithms; Algorithm design and analysis; Partitioning algorithms; Machine learning algorithms; Density measurement;

机译：聚类算法;近似算法;算法设计与分析;分区算法;机器学习算法;密度测量;

相似文献

外文文献
中文文献
专利

1. DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce [J] . Younghoon Kim, Kyuseok Shim, Min-Soeng Kim, Information Systems . 2014,第juna期

机译：DBCURE-MR：使用MapReduce的高效的基于密度的大数据聚类算法
2. Towards a cost-efficient MapReduce: Mitigating power peaks for Hadoop clusters [J] . Zhu Nan, Liu Xue, Liu Jie, Tsinghua Science and Technology . 2014,第1期

机译：迈向具有成本效益的MapReduce：减轻Hadoop集群的功率峰值
3. Towards a Cost-Efficient MapReduce: Mitigating Power Peaks for Hadoop Clusters [J] . Nan Zhu, Xue Liu, Jie Liu, 清华大学学报（英文版） . 2014,第001期

机译：迈向具有成本效益的MapReduce：减轻Hadoop集群的功率峰值
4. Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce [C] . Yanfeng Zhang, Shimin Cheny, Ge Yu IEEE International Conference on Data Engineering . 2017

机译：在MapReduce中对大型数据集进行聚类的高效分布式密度峰
5. Parallel and distributed grouping algorithms for finding related records of huge data set on cluster grids. [D] . Bheemavaram, Roopa. 2006

机译：并行和分布式分组算法，用于在集群网格上查找海量数据集的相关记录。
6. A Robust Multi-Sensor Data Fusion Clustering Algorithm Based on Density Peaks [O] . Jiande Fan, Weixin Xie, Haocui Du 2020

机译：基于密度峰值的鲁棒多传感器数据融合聚类算法
7. D 3 -MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets [O] . He Haiwu, Simonet Anthony, Anjos Julio, 2015

机译：D 3 -MapReduce：面向分布式和动态数据集的MapReduce

Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅