【24h】

A Parallel DBSCAN Algorithm Based on Spark

机译:基于Spark的并行DBSCAN算法

获取原文
获取原文并翻译 | 示例

摘要

With the explosive growth of data, we have entered the era of big data. In order to sift through masses of information, many data mining algorithms using parallelization are being implemented. Cluster analysis occupies a pivotal position in data mining, and the DBSCAN algorithm is one of the most widely used algorithms for clustering. However, when the existing parallel DBSCAN algorithms create data partitions, the original database is usually divided into several disjoint partitions, with the increase in data dimension, the splitting and consolidation of high-dimensional space will consume a lot of time. To solve the problem, this paper proposes a parallel DBSCAN algorithm (S_DBSCAN) based on Spark, which can quickly realize the partition of the original data and the combination of the clustering results. It is divided into the following steps: 1) partitioning the raw data based on a random sample, 2) computing local DBSCAN algorithms in parallel, 3) merging the data partitions based on the centroid. Compared with the traditional DBSCAN algorithm, the experimental result shows the proposed S_DBSCAN algorithm provides better operating efficiency and scalability.
机译:随着数据的爆炸性增长,我们已经进入了大数据时代。为了筛选大量信息,正在实现许多使用并行化的数据挖掘算法。聚类分析在数据挖掘中占有举足轻重的地位,而DBSCAN算法是聚类中使用最广泛的算法之一。但是,当现有的并行DBSCAN算法创建数据分区时,通常会将原始数据库划分为几个不相交的分区,随着数据维数的增加,高维空间的拆分和合并将耗费大量时间。为了解决该问题,本文提出了一种基于Spark的并行DBSCAN算法(S_DBSCAN),该算法可以快速实现原始数据的划分和聚类结果的组合。它分为以下步骤:1)基于随机样本对原始数据进行分区; 2)并行计算本地DBSCAN算法; 3)基于质心合并数据分区。与传统的DBSCAN算法相比,实验结果表明所提出的S_DBSCAN算法具有更好的运行效率和可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号