首页> 外文会议>IEEE International Conference on Big Data >Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation
【24h】

Parallel DBSCAN Algorithm Using a Data Partitioning Strategy with Spark Implementation

机译:使用数据分区策略和Spark实现的并行DBSCAN算法

获取原文

摘要

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, existing parallel implementation strategies based on MPI lack fault tolerance and there is no guarantee that their workload is balanced. Although some of Hadoop-based approaches have been proposed, they do not perform well in terms of scalability since the merge process is not efficient.We propose a scalable parallel DBSCAN algorithm by applying a partitioning strategy. It is implemented in Apache Spark. In order to reduce search time, kdtree is used in our algorithm. To achieve better performance and scalability based on kdtree, we adopt an effective partitioning technique aimed at producing balanced sub-domains which can be computed within Spark executors. Moreover, we came up with a new merging technique: through mapping the relationship between the local points and their bordering neighbors, all the partial clusters which are generated in executors are merged to form the final complete clusters. We have observed and verified (through experiments) that this merging approach is very effective in reducing the time taken for the merge phase and very scalable with increasing the number of processing cores and the generated partial clusters.We implemented the algorithm in Java, evaluated its scalability by using different number of processing cores, and using real and synthetic datasets containing up to several hundred million high-dimensional points. We used three scales of datasets to evaluate our implementation. For small scale, we use 50k, 100k, and 500k data points, obtaining up to a factor of 14.9 speedup when using 16 cores. For medium scale, we use 1.0m, 1.5m, and 1.9m data points, obtaining a factor of 109.2 speedup when using 128 cores. For large scale, we use 61.0m, 91.5m, and 115.9m data points, obtaining a factor of 8344.5 speedup when using 16384 cores.
机译:DBSCAN是一种众所周知的基于密度的聚类算法,它能够识别任意形状的聚类并消除噪声数据。但是,基于MPI的现有并行实施策略缺乏容错能力,因此无法保证其工作负载平衡。尽管已经提出了一些基于Hadoop的方法,但是由于合并过程效率不高,它们在可伸缩性方面表现不佳。我们通过应用分区策略提出了可伸缩的并行DBSCAN算法。它在Apache Spark中实现。为了减少搜索时间,在我们的算法中使用了kdtree。为了基于kdtree获得更好的性能和可伸缩性,我们采用了一种有效的分区技术,旨在产生可在Spark执行程序中计算的平衡子域。此外,我们提出了一种新的合并技术:通过映射局部点和其相邻邻居之间的关系,将执行程序中生成的所有局部簇合并以形成最终的完整簇。我们已经观察并验证(通过实验),这种合并方法在减少合并阶段所需的时间方面非常有效,并且在增加处理核心数量和生成的部分簇的情况下具有很好的可扩展性。通过使用不同数量的处理核心,以及使用包含多达几亿个高维点的实数和合成数据集来实现可伸缩性。我们使用了三个规模的数据集来评估我们的实施情况。对于小规模,我们使用50k,100k和500k数据点,当使用16个内核时,可获得高达14.9倍的加速。对于中等规模,我们使用1.0m,1.5m和1.9m数据点,使用128个内核时,可获得109.2倍的加速。对于大规模,我们使用61.0m,91.5m和115.9m数据点,当使用16384个内核时,可获得8344.5倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号