首页> 外文期刊>Frontiers of computer science in China >MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data
【24h】

MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data

机译:MR-DBSCAN:基于MapReduce的可扩展DBSCAN算法,用于处理严重偏斜的数据

获取原文
获取原文并翻译 | 示例
           

摘要

DBSCAN (density-based spatial clustering of applications with noise) is an important spatial clustering technique that is widely adopted in numerous applications. As the size of datasets is extremely large nowadays, parallel processing of complex data analysis such as DBSCAN becomes indispensable. However, there are three major drawbacks in the existing parallel DBSCAN algorithms. First, they fail to properly balance the load among parallel tasks, especially when data are heavily skewed. Second, the scalability of these algorithms is limited because not all the critical sub-procedures are parallelized. Third, most of them are not primarily designed for shared-nothing environments, which makes them less portable to emerging parallel processing paradigms. In this paper, we present MR-DBSCAN, a scalable DBSCAN algorithm using MapReduce. In our algorithm, all the critical sub-procedures are fully parallelized. As such, there is no performance bottleneck caused by sequential processing. Most importantly, we propose a novel data partitioning method based on computation cost estimation. The objective is to achieve desirable load balancing even in the context of heavily skewed data. Besides, We conduct our evaluation using real large datasets with up to 1.2 billion points. The experiment results well confirm the efficiency and scalability of MR-DBSCAN.
机译:DBSCAN(带噪声的应用程序的基于密度的空间聚类)是一种重要的空间聚类技术,已被许多应用程序广泛采用。由于当今数据集的规模非常大,因此复杂数据分析(例如DBSCAN)的并行处理变得必不可少。但是,现有的并行DBSCAN算法存在三个主要缺点。首先,它们无法在并行任务之间适当平衡负载,尤其是在数据严重偏斜的情况下。其次,这些算法的可扩展性受到限制,因为并非所有关键子过程都可以并行化。第三,它们中的大多数并不是主要为没有共享的环境而设计的,这使得它们对于新兴的并行处理范例的可移植性较差。在本文中,我们介绍了MR-DBSCAN,这是一种使用MapReduce的可扩展DBSCAN算法。在我们的算法中,所有关键子过程都是完全并行的。这样,就不会有顺序处理引起的性能瓶颈。最重要的是,我们提出了一种基于计算成本估算的新型数据划分方法。目的是即使在数据严重偏斜的情况下也可以实现理想的负载平衡。此外,我们使用高达12亿点的真实大型数据集进行评估。实验结果很好地证实了MR-DBSCAN的效率和可扩展性。

著录项

  • 来源
    《Frontiers of computer science in China》 |2014年第1期|83-99|共17页
  • 作者单位

    Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China,University of Chinese Academy of Sciences, Beijing 100049, China;

    Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong University of Science and Technology, Hong Kong 999077, China;

    Department of Computer Science, Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong University of Science and Technology, Hong Kong 999077, China;

    Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;

    Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    data clustering; parallel algorithm; data mining; load balancing;

    机译:数据聚类;并行算法数据挖掘;负载均衡;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号