首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium Workshops >A novel scalable DBSCAN algorithm with Spark

【24h】

A novel scalable DBSCAN algorithm with Spark

机译：一种具有火花的新型可扩展DBSCAN算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

DBSCAN is a well-known clustering algorithm which is based on density and is able to identify arbitrary shaped clusters and eliminate noise data. However, parallelization of DBSCAN is a challenging work because based on MPI or OpenMP environments, there exist the issues of lack of fault-tolerance and there is no guarantee that workload is balanced. Moreover, programming with MPI requires data scientists to have an advanced experience to handle communication between nodes which is a big challenge. We present a new parallel DBSCAN algorithm using the new big data framework Spark. In order to reduce search time, we apply kd-tree in our algorithm. More specifically, we propose a novel approach to avoid communication between executors so that we can locally obtain partial clusters more efficiently. Based on Java API, we select appropriate data structures carefully: Using Queue to contain neighbors of the data point, and using Hashtable when checking the status of and processing the data points. In addition, we use other advanced features from Spark to make our implementation more effective. We implement the algorithm in Java and evaluate its scalability by using different number of processing cores. Our experiments demonstrate that the algorithm we propose scales up very well. Using data sets containing up to 1 million high-dimensional points, we show that our proposed algorithm achieves speedups up to 6 using 8 cores (10k), 10 using 32 cores (100k), and 137 using 512 cores (1m). Another experiment using 10k data points is conducted and the result shows that the algorithm with MapReduce achieves speedups to 1.3 using 2 cores, 2.0 using 4 cores, and 3.2 using 8 cores.

机译：DBSCAN是一种以众所周知的聚类算法，基于密度，能够识别任意形状的簇并消除噪声数据。但是，DBSCAN的并行化是一个具有挑战性的工作，因为基于MPI或OpenMP环境，存在缺乏容错的问题，并且无法保证工作负载是平衡的。此外，使用MPI编程需要数据科学家具有先进的体验来处理节点之间的通信，这是一个很大的挑战。我们使用新的大数据框架火花提出了一种新的并行DBSCAN算法。为了减少搜索时间，我们在算法中应用KD树。更具体地，我们提出了一种新的方法来避免执行者之间的通信，以便我们可以更有效地获得局部簇。基于Java的API，我们精心选择合适的数据结构：使用队列来存放数据点的邻居，和检查的状态和处理数据点的时候使用哈希表。此外，我们使用Spark的其他高级功能使我们的实施更加有效。我们在Java中实现了算法，并通过使用不同数量的处理核来评估其可伸缩性。我们的实验表明，我们提出的算法非常好。使用包含多达100万高维点的数据集，我们表明我们所提出的算法使用8个芯（10K），10使用32芯（100K）和137使用512芯（1M）来实现最多6的加速度。进行了另一个使用10k数据点的实验，结果表明，Mapreduce算法使用2个核心，2.0使用4个核心，3.2使用8个核心，实现了加速。

著录项

来源
《IEEE International Parallel and Distributed Processing Symposium Workshops 》|2016年|929-1839p|共10页
会议地点
作者
Dianwei Han; Ankit Agrawal; Wei-keng Liao; Alok Choudhary;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13-53;
关键词
DBSCAN; Clustering; Big data; Spark framework;

机译：dbscan;聚类;大数据;火花框架;

相似文献

外文文献
中文文献
专利

1. DBSCAN-PSM: an improvement method of DBSCAN algorithm on Spark [J] . Guangsheng Chen, Yiqun Cheng, Weipeng Jing International Journal of High Performance Computing and Networking . 2019 ,第4期

机译：DBSCAN-PSM：火花上DBSCAN算法的改进方法
2. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data [J] . Yaobin HE, Haoyu TAN, Wuman LUO, Frontiers of computer science in China . 2014 ,第1期

机译：MR-DBSCAN：基于MapReduce的可扩展DBSCAN算法，用于处理严重偏斜的数据
3. Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique [J] . Guan Ji-hong, Zhou Shui-geng, Bian Fu-ling, Wuhan University Journal of Natural Sciences . 2001 ,第1a2期

机译：扩展基于采样技术的DBSCAN算法以对大型空间数据库进行聚类
4. A Novel Scalable DBSCAN Algorithm with Spark [C] . Dianwei Han, Ankit Agrawal, Wei-Keng Liao, IEEE International Parallel and Distributed Processing Symposium Workshops and PhD Forum . 2016

机译：一种带有Spark的新型可扩展DBSCAN算法
5. Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection [D] . Dobson, Anthony M. 2018

机译：用于入侵检测的Apache Spark中机器学习算法的性能评估
6. A Comparative Analysis of DBSCAN K-Means and Quadratic Variation Algorithms for Automatic Identification of Swallows from Swallowing Accelerometry Signals [O] . Joshua M. Dudik, Atsuko Kurosu, James L Coyle, -1

机译：从吞咽加速度计信号中自动识别吞咽的DBSCANK-均值和二次方差算法的比较分析
7. Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform [O] . Fang Huang, Qiang Zhu, Ji Zhou, 2017

机译：基于spark平台的DBsCaN空间数据挖掘聚类算法并行化研究
8. LLNl small-scale static spark machine: static spark sensitivity test [R] . Foltz, M. F., Simpson, L. R. 1999

机译：LLNl小型静电火花机：静态火花灵敏度测试

A novel scalable DBSCAN algorithm with Spark

摘要

著录项

相似文献

相关主题

期刊订阅