MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

Yaobin HE; Haoyu TAN; Wuman LUO; Shengzhong FENG; Jianping FAN

首页> 外文期刊>Frontiers of computer science in China >MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

【24h】

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

机译：MR-DBSCAN：基于MapReduce的可扩展DBSCAN算法，用于处理严重偏斜的数据

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.

机译：DBSCAN（带噪声的应用程序的基于密度的空间聚类）是一种重要的空间聚类技术，已被许多应用程序广泛采用。由于当今数据集的规模非常大，因此复杂数据分析（例如DBSCAN）的并行处理变得必不可少。但是，现有的并行DBSCAN算法存在三个主要缺点。首先，它们无法在并行任务之间适当平衡负载，尤其是在数据严重偏斜的情况下。其次，这些算法的可扩展性受到限制，因为并非所有关键子过程都可以并行化。第三，它们中的大多数并不是主要为没有共享的环境而设计的，这使得它们对于新兴的并行处理范例的可移植性较差。在本文中，我们介绍了MR-DBSCAN，这是一种使用MapReduce的可扩展DBSCAN算法。在我们的算法中，所有关键子过程都是完全并行的。这样，就不会有顺序处理引起的性能瓶颈。最重要的是，我们提出了一种基于计算成本估算的新型数据划分方法。目的是即使在数据严重偏斜的情况下也可以实现理想的负载平衡。此外，我们使用高达12亿点的真实大型数据集进行评估。实验结果很好地证实了MR-DBSCAN的效率和可扩展性。

著录项

来源
《Frontiers of computer science in China》 |2014年第1期|83-99|共17页
作者
Yaobin HE; Haoyu TAN; Wuman LUO; Shengzhong FENG; Jianping FAN;
展开▼
作者单位

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China,University of Chinese Academy of Sciences, Beijing 100049, China;

Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong University of Science and Technology, Hong Kong 999077, China;

Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong University of Science and Technology, Hong Kong 999077, China;

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
data clustering; parallel algorithm; data mining; load balancing;

机译：数据聚类;并行算法数据挖掘;负载均衡;

相似文献

外文文献
中文文献
专利

1. Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique [J] . Guan Ji-hong, Zhou Shui-geng, Bian Fu-ling, Wuhan University Journal of Natural Sciences . 2001,第1a2期

机译：扩展基于采样技术的DBSCAN算法以对大型空间数据库进行聚类
2. Approaches for Scaling DBSCAN Algorithm to Large Spatial Databases [J] . ZHOU Aoying, ZHOU Shuigeng, CAO Jing Journal of Computer Science & Technology . 2000,第6期

机译：将DBSCAN算法扩展到大型空间数据库的方法
3. Approaches for Scaling DBSCAN Algorithm to Large Spatial Databases [J] . ZHOU Aoying, ZHOU Shuigeng, CAO Jing, Journal of Computer Science & Technology . 2000,第6期

机译：将DBSCAN算法扩展到大型空间数据库的方法
4. MapReduce-based Dragonfly Algorithm for large-scale Data-Clustering [C] . Ashish Kumar Tripathi, Pranav Saxena, Siddharth Gupta International Conference on Image Information Processing . 2019

机译：基于MapReduce的蜻蜓算法进行大规模数据聚类
5. Kernel-Based Specification Testing with Skewed and Heavy-Tailed Data [D] . Sheehan, Alice E. 2019

机译：基于内核的规范测试，带有偏斜和重尾数据
6. ToxDBScan: Large-Scale Similarity Screening of Toxicological Databases for Drug Candidates [O] . Michael Römer, Linus Backert, Johannes Eichner, 1997

机译：ToxDBScan：药物候选毒理学数据库的大规模相似性筛选
7. A New Scalable Parallel DBSCAN Algorithm Using the Disjoint-Set Data Structure [O] . Md. Mostofa, Ali Patwary, Diana Palsetia, 2013

机译：使用不交集数据结构的新的可伸缩并行DBSCAN算法

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

摘要

著录项

相似文献

相关主题

期刊订阅