首页> 外文期刊>IEEE Transactions on Computers >Parallel Hierarchical Subspace Clustering of Categorical Data
【24h】

Parallel Hierarchical Subspace Clustering of Categorical Data

机译:并行分层子空间群集分类数据

获取原文
获取原文并翻译 | 示例
           

摘要

Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).
机译:并行聚类是大数据分析的重要研究领域。传统的分层凝聚聚类(HAC)技术不充分,以处理由于两个缺点而处理大规模的分类数据集。首先,HAC消耗过多的CPU时间和内存资源;其次,将群集任务分解为并行执行的独立子任务是不普遍的。我们通过基于MapReduce的分层子空间群集算法 - 使用基于LSH的数据分区来解决这两个问题。 PAPU有利于将大规模数据集分区为多个独立的子数据集,映射到类似的数据对象。倡导并行计算,PAPU从本地聚类阶段中的独立块中获取对应于各个属性子空间的子簇。为了提高近似聚类结果的准确性,PAPU通过在全局聚类阶段应用分层聚类方案来迭代地合并子集群来测量各种缩放簇。我们在24节点Hadoop计算平台上实施Papu。实验结果表明,与数据分区策略耦合的分层子空间聚类在合成和实际大规模数据集中实现了高集群效率。实验还表明,突布在可扩展性和可扩展性方面具有卓越的性能(例如,几乎线性加速)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号