首页> 外文期刊>Expert Systems with Application >PUMA: Parallel subspace clustering of categorical data using multi-attribute weights
【24h】

PUMA: Parallel subspace clustering of categorical data using multi-attribute weights

机译:PUMA:使用多属性权重的分类数据的并行子空间聚类

获取原文
获取原文并翻译 | 示例

摘要

There are two main reasons why traditional clustering schemes are incompetent for high-dimensional categorical data. First, traditional methods usually represent each cluster by all dimensions without difference; and second, traditional clustering methods only rely on an individual dimension of projection as an attribute's weight ignoring relevance among attributes. We solve these two problems by a MapReduce-based subspace clustering algorithm (called PUMA) using multi-attribute weights. The attribute subspaces are constructed in our PUMA by calculating an attribute-value weight based on the co-occurrence probability of attribute values among different dimensions. PUMA obtains sub-clusters corresponding to respective attribute subspaces from each computing node in parallel. Lastly, PUMA measures various scale clusters by applying the hierarchical clustering method to iteratively merge sub-clusters. We implement PUMA on a 24-node Hadoop cluster. Experimental results reveal that using multi-attribute weights with subspace clustering can achieve better clustering accuracy on both synthetic and real-world high dimensional datasets. Experimental results also show that PUMA achieves high performance in terms of extensibility, scalability and the nearly linear speedup with respect to number of nodes. Additionally, experimental results demonstrate that PUMA is reasonable, effective, and practical to expert systems such as knowledge acquisition, word sense disambiguation, automatic abstracting and recommender systems. (C) 2019 Elsevier Ltd. All rights reserved.
机译:传统聚类方案对于高维分类数据不起作用的主要原因有两个。首先,传统方法通常以各个维度来表示每个群集,而没有差异。第二,传统的聚类方法仅依靠投影的单个维度作为属性的权重,而忽略属性之间的相关性。我们通过使用多属性权重的基于MapReduce的子空间聚类算法(称为PUMA)解决了这两个问题。在我们的PUMA中,通过基于不同维度之间属性值的同时出现概率来计算属性值权重,来构造属性子空间。 PUMA从每个计算节点并行获取与各个属性子空间相对应的子集群。最后,PUMA通过应用层次聚类方法迭代合并子集群来测量各种规模的聚类。我们在24节点Hadoop集群上实现PUMA。实验结果表明,在子空间聚类中使用多属性权重可以在合成的和真实的高维数据集上实现更好的聚类精度。实验结果还表明,PUMA在可扩展性,可伸缩性和相对于节点数的近乎线性加速方面均达到了高性能。此外,实验结果表明,PUMA对于专家系统(例如知识获取,词义歧义消除,自动抽象和推荐系统)是合理,有效和实用的。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号