Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

机译：使用数据分区策略和Spark实现的并行DBSCAN算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, existing parallel implementation strategies based on MPI lack fault tolerance and there is no guarantee that their workload is balanced. Although some of Hadoop-based approaches have been proposed, they do not perform well in terms of scalability since the merge process is not efficient.We propose a scalable parallel DBSCAN algorithm by applying a partitioning strategy. It is implemented in Apache Spark. In order to reduce search time, kdtree is used in our algorithm. To achieve better performance and scalability based on kdtree, we adopt an effective partitioning technique aimed at producing balanced sub-domains which can be computed within Spark executors. Moreover, we came up with a new merging technique: through mapping the relationship between the local points and their bordering neighbors, all the partial clusters which are generated in executors are merged to form the final complete clusters. We have observed and verified (through experiments) that this merging approach is very effective in reducing the time taken for the merge phase and very scalable with increasing the number of processing cores and the generated partial clusters.We implemented the algorithm in Java, evaluated its scalability by using different number of processing cores, and using real and synthetic datasets containing up to several hundred million high-dimensional points. We used three scales of datasets to evaluate our implementation. For small scale, we use 50k, 100k, and 500k data points, obtaining up to a factor of 14.9 speedup when using 16 cores. For medium scale, we use 1.0m, 1.5m, and 1.9m data points, obtaining a factor of 109.2 speedup when using 128 cores. For large scale, we use 61.0m, 91.5m, and 115.9m data points, obtaining a factor of 8344.5 speedup when using 16384 cores.

机译：DBSCAN是一种众所周知的基于密度的聚类算法，它能够识别任意形状的聚类并消除噪声数据。但是，基于MPI的现有并行实施策略缺乏容错能力，因此无法保证其工作负载平衡。尽管已经提出了一些基于Hadoop的方法，但是由于合并过程效率不高，它们在可伸缩性方面表现不佳。我们通过应用分区策略提出了可伸缩的并行DBSCAN算法。它在Apache Spark中实现。为了减少搜索时间，在我们的算法中使用了kdtree。为了基于kdtree获得更好的性能和可伸缩性，我们采用了一种有效的分区技术，旨在产生可在Spark执行程序中计算的平衡子域。此外，我们提出了一种新的合并技术：通过映射局部点和其相邻邻居之间的关系，将执行程序中生成的所有局部簇合并以形成最终的完整簇。我们已经观察并验证（通过实验），这种合并方法在减少合并阶段所需的时间方面非常有效，并且在增加处理核心数量和生成的部分簇的情况下具有很好的可扩展性。通过使用不同数量的处理核心，以及使用包含多达几亿个高维点的实数和合成数据集来实现可伸缩性。我们使用了三个规模的数据集来评估我们的实施情况。对于小规模，我们使用50k，100k和500k数据点，当使用16个内核时，可获得高达14.9倍的加速。对于中等规模，我们使用1.0m，1.5m和1.9m数据点，使用128个内核时，可获得109.2倍的加速。对于大规模，我们使用61.0m，91.5m和115.9m数据点，当使用16384个内核时，可获得8344.5倍的加速。

著录项

来源
《IEEE International Conference on Big Data》|2018年|305-312|共8页
会议地点
作者
Dianwei Han; Ankit Agrawal; Wei-keng Liao; Alok Choudhary;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sparks; Clustering algorithms; Task analysis; Partitioning algorithms; Data models; Master-slave; Distributed databases;

机译：Sparks;聚类算法;任务分析;分区算法;数据模型;主从;分布式数据库;
入库时间 2022-08-26 15:03:44

相似文献

外文文献
中文文献
专利

1. Parallelization Strategy for Elementary Morphological Operators on Graphs: Distance-Based Algorithms and Implementation on Multicore Shared-Memory Architecture [J] . Youkana Imane, Cousty Jean, Saouli Rachida, Journal of mathematical imaging and vision . 2017,第1期

机译：基本形态运算符对图中的并行化策略：基于距离的算法和多核共享内存架构的实现
2. Implementing genetic algorithms to CUDA environment using data parallelization [J] . Oiso Masashi, Matsumura Yoshiyuki, Yasuda Toshiyuki, Technical Gazette . 2011,第4期

机译：使用数据并行化在CUDA环境中实现遗传算法
3. Implementing genetic algorithms to CUDA environment using data parallelization [J] . Oiso Masashi, Matsumura Yoshiyuki, Yasuda Toshiyuki, Technical Gazette . 2011,第4期

机译：使用数据并行化在CUDA环境中实现遗传算法
4. Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation [C] . Dianwei Han, Ankit Agrawal, Wei-keng Liao, IEEE International Conference on Big Data . 2018

机译：并行DBSCAN算法使用带有Spark实现的数据分区策略
5. Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation [D] . Hidalgo Murillo, Manuel E. 2016

机译：在数据并行处理实现算法中，使用统计分析来改善数据划分
6. A Parallel Architecture for the Partitioning around Medoids (PAM) Algorithm for Scalable Multi-Core Processor Implementation with Applications in Healthcare [O] . Hassan Mushtaq, Sajid Gul Khawaja, Muhammad Usman Akram, 2018

机译：围绕Medoids（PAM）算法进行分区的并行体系结构可实现可扩展的多核处理器及其在医疗保健中的应用
7. Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform [O] . Fang Huang, Qiang Zhu, Ji Zhou, 2017

机译：基于spark平台的DBsCaN空间数据挖掘聚类算法并行化研究

Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

摘要

著录项

相似文献

相关主题

期刊订阅