A Scalable Hierarchical Clustering Algorithm Using Spark

机译：一种使用Spark的可扩展分层聚类算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.

机译：群集通常是数据挖掘的重要第一步，用于减少冗余，或定义数据类别。分层聚类，广泛使用的聚类技术可以通过建议潜在的组结构来提供更丰富的表示。然而，这种算法的并行化是具有挑战性的，因为它在分层树结构期间表现出固有的数据依赖性。在本文中，我们通过将其作为最小生成树问题设计了单链接分层聚类的并行实现。我们进一步表明，由于其自然表达的迭代过程的自然表达，Spark是一种自然的适合单键聚类算法的平行化。我们的算法可以在亚马逊的云环境中轻松部署。亚马逊EC2中的彻底性能评估验证了我们算法的可扩展性是否在数据集缩放时维持。

著录项

来源
《IEEE International Conference on Big Data Computing Service and Applications》|2015年||共9页
会议地点
作者
Jin Chen; Liu Ruoqian; Chen Zhengzhang; Hendrix William; Agrawal Ankit; Choudhary Alok;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP212;
关键词
Hierarchical Clustering; Minimum Spanning Tree; Spark;

机译：分层聚类;最小生成树;火花;

相似文献

外文文献
中文文献
专利

1. Power Reduction in Very Large Scale Integration Circuit Using Spectral Clustering-Hierarchical Mean Cut Clustering Algorithm (SC-HMCC) [J] . T. Kowsalya, S. Palaniswami Journal of computational and theoretical nanoscience . 2016,第3期

机译：使用光谱聚类分层均衡聚类算法（SC-HMCC）的极大级集成电路的功率降低
2. Does Determination of Initial Cluster Centroids Improve the Performance of K-Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm, Minimum Spanning Tree, and Hierarchical Clustering in an Applied Study [J] . Saeedeh Pourahmad, Atefeh Basirat, Amir Rahimi, Computational and mathematical methods in medicine . 2020,第1期

机译：初始簇质心的确定是否提高了K-Means聚类算法的性能？应用研究中遗传算法，最小生成树和分层聚类的三种混合方法的比较
3. SCALABLE PARALLEL BIG DATA SUMMARIZATION TECHNIQUE BASED ON HIERARCHICAL CLUSTERING ALGORITHM [J] . VERONICA S. MOERTINI, MATTHEW ARIEL Journal of Theoretical and Applied Information Technology . 2020,第21期

机译：基于分层聚类算法的可扩展并行大数据摘要技术
4. A Scalable Hierarchical Clustering Algorithm Using Spark [C] . Jin Chen, Liu Ruoqian, Chen Zhengzhang, IEEE International Conference on Big Data Computing Service and Applications . 2015

机译：使用Spark的可伸缩分层聚类算法
5. Design of a Scalable, Configurable, and Cluster-based Hierarchical Hardware Accelerator for a Cortically Inspired Algorithm and Recurrent Neural Networks [D] . Dey, Sumon. 2019

机译：设计可扩展，可配置和基于群集的分层硬件加速器，用于显影灵感算法和经常性神经网络
6. Does Determination of Initial Cluster Centroids Improve the Performance of K-Means Clustering Algorithm? Comparison of Three Hybrid Methods by Genetic Algorithm Minimum Spanning Tree and Hierarchical Clustering in an Applied Study [O] . Saeedeh Pourahmad, Atefeh Basirat, Amir Rahimi, 2020

机译：初始簇质心的确定是否提高了K-Means聚类算法的性能？应用研究中遗传算法最小生成树和分层聚类的三种混合方法的比较
7. Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud [O] . Bożena Małysiak-Mrozek, Tomasz Baron, Dariusz Mrozek 2018

机译：Spark-IDPP：高通量和可扩展的云层上有着火花簇的内部无序蛋白质区的可扩展预测

A Scalable Hierarchical Clustering Algorithm Using Spark

摘要

著录项

相似文献

相关主题

期刊订阅