首页> 外文期刊>IEEE Transactions on Computers >Parallel Hierarchical Subspace Clustering of Categorical Data
【24h】

Parallel Hierarchical Subspace Clustering of Categorical Data

机译:分类数据的并行层次子空间聚类

获取原文
获取原文并翻译 | 示例
           

摘要

Parallel clustering is an important research area of big data analysis. The conventional Hierarchical Agglomerative Clustering (HAC) techniques are inadequate to handle big-scale categorical datasets due to two drawbacks. First, HAC consumes excessive CPU time and memory resources; and second, it is non-trivial to decompose clustering tasks into independent sub-tasks executed in parallel. We solve these two problems by a MapReduce-based hierarchical subspace-clustering algorithm - called PAPU - using LSH-based data partitioning. PAPU is conducive to partitioning a large-scale dataset into multiple independent sub-datasets, into which similar data objects are mapped. Advocating parallel computing, PAPU obtains sub-clusters corresponding to respective attribute subspaces from independent chunks in the local clustering phase. To improve the accuracy of approximated clustering results, PAPU measures various scale clusters by applying the hierarchical clustering scheme to iteratively merge sub-clusters during the global clustering phase. We implement PAPU on a 24-node Hadoop computing platform. The experimental results reveal that hierarchical subspace-clustering coupled with the data-partitioning strategy achieves high clustering efficiency on both synthetic and real-world large-scale datasets. The experiments also demonstrate that PAPU delivers superior performance in terms of extensibility and scalability (e.g., a nearly linear speedup).
机译:并行集群是大数据分析的重要研究领域。由于两个缺点,常规的层次聚集聚类(HAC)技术不足以处理大规模分类数据集。首先,HAC消耗过多的CPU时间和内存资源。其次,将聚类任务分解为并行执行的独立子任务并非易事。我们通过使用基于LSH的数据分区的基于MapReduce的分层子空间聚类算法(称为PAPU)解决了这两个问题。 PAPU有助于将大型数据集划分为多个独立的子数据集,在这些子数据集中映射相似的数据对象。倡导并行计算,PAPU在局部聚类阶段从独立块中获取与各个属性子空间相对应的子集群。为了提高近似聚类结果的准确性,PAPU通过应用分层聚类方案在全局聚类阶段迭代合并子类来测量各种规模的聚类。我们在24节点Hadoop计算平台上实现PAPU。实验结果表明,层次化子空间聚类与数据分区策略相结合,在合成和真实世界的大型数据集上均实现了较高的聚类效率。实验还证明,PAPU在可扩展性和可伸缩性方面(例如,几乎线性的加速)提供了卓越的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号