【24h】

A novel scalable DBSCAN algorithm with Spark

机译:一种具有火花的新型可扩展DBSCAN算法

获取原文

摘要

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to have an advanced experience to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using the new big data framework Spark. In order to reduce search time, we apply kd-tree in our algorithm. More specifically, we propose a novel approach to avoid communication between executors so that we can locally obtain partial clusters more efficiently. Based on Java API, we select appropriate data structures carefully: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. In addition, we use other advanced features from Spark to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10k), 10 using 32 cores (100k), and 137 using 512 cores (1m). Another experiment using 10k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.
机译:DBSCAN是一种以众所周知的聚类算法,基于密度,能够识别任意形状的簇并消除噪声数据。但是,DBSCAN的并行化是一个具有挑战性的工作,因为基于MPI或OpenMP环境,存在缺乏容错的问题,并且无法保证工作负载是平衡的。此外,使用MPI编程需要数据科学家具有先进的体验来处理节点之间的通信,这是一个很大的挑战。我们使用新的大数据框架火花提出了一种新的并行DBSCAN算法。为了减少搜索时间,我们在算法中应用KD树。更具体地,我们提出了一种新的方法来避免执行者之间的通信,以便我们可以更有效地获得局部簇。基于Java的API,我们精心选择合适的数据结构:使用队列来存放数据点的邻居,和检查的状态和处理数据点的时候使用哈希表。此外,我们使用Spark的其他高级功能使我们的实施更加有效。我们在Java中实现了算法,并通过使用不同数量的处理核来评估其可伸缩性。我们的实验表明,我们提出的算法非常好。使用包含多达100万高维点的数据集,我们表明我们所提出的算法使用8个芯(10K),10使用32芯(100K)和137使用512芯(1M)来实现最多6的加速度。进行了另一个使用10k数据点的实验,结果表明,Mapreduce算法使用2个核心,2.0使用4个核心,3.2使用8个核心,实现了加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号