Parallel Hierarchical Subspace Clustering of Categorical Data

Pang Ning; Zhang Jifu; Zhang Chaowei; Qin Xiao

首页> 外文期刊>IEEE Transactions on Computers >Parallel Hierarchical Subspace Clustering of Categorical Data

【24h】

Parallel Hierarchical Subspace Clustering of Categorical Data

机译：分类数据的并行层次子空间聚类

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).

机译：并行集群是大数据分析的重要研究领域。由于两个缺点，常规的层次聚集聚类（HAC）技术不足以处理大规模分类数据集。首先，HAC消耗过多的CPU时间和内存资源。其次，将聚类任务分解为并行执行的独立子任务并非易事。我们通过使用基于LSH的数据分区的基于MapReduce的分层子空间聚类算法（称为PAPU）解决了这两个问题。 PAPU有助于将大型数据集划分为多个独立的子数据集，在这些子数据集中映射相似的数据对象。倡导并行计算，PAPU在局部聚类阶段从独立块中获取与各个属性子空间相对应的子集群。为了提高近似聚类结果的准确性，PAPU通过应用分层聚类方案在全局聚类阶段迭代合并子类来测量各种规模的聚类。我们在24节点Hadoop计算平台上实现PAPU。实验结果表明，层次化子空间聚类与数据分区策略相结合，在合成和真实世界的大型数据集上均实现了较高的聚类效率。实验还证明，PAPU在可扩展性和可伸缩性方面（例如，几乎线性的加速）提供了卓越的性能。

著录项

来源
《IEEE Transactions on Computers》 |2019年第4期|542-555|共14页
作者
Pang Ning; Zhang Jifu; Zhang Chaowei; Qin Xiao;
展开▼
作者单位

Taiyuan Univ Sci & Technol TYUST, Taiyuan 030024, Shanxi, Peoples R China;

Taiyuan Univ Sci & Technol TYUST, Taiyuan 030024, Shanxi, Peoples R China;

Auburn Univ, Samuel Ginn Coll Engn, Dept Comp Sci & Software Engn, Auburn, AL 36849 USA;

Auburn Univ, Samuel Ginn Coll Engn, Dept Comp Sci & Software Engn, Auburn, AL 36849 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Hierarchical subspace-clustering; LSH-based data partitioning; categorical data; Hadoop;

机译：分层子空间群集;基于LSH的数据分区;分类数据;HADOOP;

相似文献

外文文献
中文文献
专利

1. Parallel Hierarchical Subspace Clustering of Categorical Data [J] . Pang Ning, Zhang Jifu, Zhang Chaowei, IEEE Transactions on Computers . 2019,第4期

机译：并行分层子空间群集分类数据
2. PUMA: Parallel subspace clustering of categorical data using multi-attribute weights [J] . Pang Ning, Zhang Jifu, Zhang Chaowei, Expert Systems with Application . 2019,第JULa期

机译：PUMA：使用多属性权重的分类数据的并行子空间聚类
3. A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets [J] . Amir Ahmad, Lipika Dey Pattern recognition letters . 2011,第7期

机译：一种k均值类型聚类算法，用于混合数值和分类数据集的子空间聚类
4. A Subspace Hierarchical Clustering Algorithm for Categorical Data [C] . Joel Luís Carbonera, Mara Abel IEEE International Conference on Tools with Artificial Intelligence . 2019

机译：分类数据的子空间层次聚类算法
5. Automatic categorical data clustering and spatial data clustering by consecutive resolution refinement. [D] . Foss, Andrew Philip Ogilvie. 2002

机译：通过连续的分辨率优化自动分类数据聚类和空间数据聚类。
6. Evaluation of Modified Categorical Data Fuzzy Clustering Algorithm on the Wisconsin Breast Cancer Dataset [O] . Amir Ahmad 2016

机译：改进的分类数据模糊聚类算法对威斯康星州乳腺癌数据集的评估
7. CLICKS: Mining Subspace Clusters in Categorical Data Via k-Partite Maximal Cliques [O] . Mohammad J. Zaki et al. 2008

机译：单击：通过k部分最大派系在分类数据中挖掘子空间簇

Parallel Hierarchical Subspace Clustering of Categorical Data

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅