首页> 外文会议>IEEE International Conference on Big Data Computing Service and Applications >A Scalable Hierarchical Clustering Algorithm Using Spark
【24h】

A Scalable Hierarchical Clustering Algorithm Using Spark

机译:一种使用Spark的可扩展分层聚类算法

获取原文

摘要

Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.
机译:群集通常是数据挖掘的重要第一步,用于减少冗余,或定义数据类别。分层聚类,广泛使用的聚类技术可以通过建议潜在的组结构来提供更丰富的表示。然而,这种算法的并行化是具有挑战性的,因为它在分层树结构期间表现出固有的数据依赖性。在本文中,我们通过将其作为最小生成树问题设计了单链接分层聚类的并行实现。我们进一步表明,由于其自​​然表达的迭代过程的自然表达,Spark是一种自然的适合单键聚类算法的平行化。我们的算法可以在亚马逊的云环境中轻松部署。亚马逊EC2中的彻底性能评估验证了我们算法的可扩展性是否在数据集缩放时维持。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号