Clustering Categorical Data Based on Distance Vectors

Peng Zhang; Xiaogang Wang; Peter X.-K. Song

首页> 外文期刊>Journal of the American statistical association >Clustering Categorical Data Based on Distance Vectors

【24h】

Clustering Categorical Data Based on Distance Vectors

机译：基于距离向量的分类数据聚类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We introduce a novel statistical procedure for clustering categorical data based on Hamming distance (HD) vectors. The proposed method is conceptually simple and computationally straightforward, because it does not require any specific statistical models or any convergence criteria. Moreover, unlike most currently existing algorithms that compute the class membership or membership probability for every data point at each iteration, our algorithm sequentially extracts clusters from the given dataset. That is, at each iteration our algorithm strives to identify only one cluster, which will then be deleted from the dataset at the next iteration; this procedure repeats until there are no more significant clusters in the remaining data. Consequently, the number of clusters can be determined automatically by the algorithm. As for the identification and extraction of a cluster, we first locate the cluster center by using a Pearson chi-squared-type statistic on the basis of HD vectors. The partition of the dataset produced by our algorithm is unique and insensitive to the input order of data points. The performance of the proposed algorithm is examined using both simulated and real world datasets. Comparisons with two well-known clustering algorithms, K-modes and AutoClass, show that the proposed algorithm substantially outperforms these competitors, with the classification rate or the information gain typically improved by several orders of magnitude. Computational complexity and run time comparisons are also provided.

机译：我们介绍了一种新的统计程序，用于基于汉明距离（HD）向量对分类数据进行聚类。所提出的方法在概念上很简单，计算上也很简单，因为它不需要任何特定的统计模型或任何收敛准则。而且，与当前大多数现有算法（在每次迭代中为每个数据点计算类成员资格或成员资格概率）不同，我们的算法从给定的数据集中顺序提取聚类。也就是说，在每次迭代中，我们的算法都力图只识别一个簇，然后在下一次迭代时将其从数据集中删除；重复此过程，直到剩余数据中没有更多的重要簇为止。因此，可以通过算法自动确定聚类数。关于聚类的识别和提取，我们首先基于高清向量使用皮尔逊卡方型统计量来定位聚类中心。由我们的算法生成的数据集的分区是唯一的，并且对数据点的输入顺序不敏感。使用模拟和真实数据集检查了所提出算法的性能。与两种著名的聚类算法K-mode和AutoClass的比较表明，所提出的算法明显优于这些竞争者，分类率或信息增益通常提高了几个数量级。还提供了计算复杂性和运行时间比较。

著录项

来源
《Journal of the American statistical association》 |2006年第473期|p.355-367|共13页
作者
Peng Zhang; Xiaogang Wang; Peter X.-K. Song;
展开▼
作者单位

Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类统计学;
关键词
autoclass; categorical data; clustering; computational complexity; distance vector; hamming distance; K-modes; modified chi-squared statistic;

机译：自动分类;分类数据;聚类;计算复杂度;距离向量;汉明距离;K-模式;修正的卡方统计量;
入库时间 2022-08-18 02:30:09

相似文献

外文文献
中文文献
专利

1. The Clustering of Categorical Data: A Comparison of a Model-based and a Distance-based Approach [J] . LAURA ANDERLUCCI, CHRISTIAN HENNIG Communications in Statistics . 2014,第4a6期

机译：分类数据的聚类：基于模型和基于距离的方法的比较
2. An Initialization Method for Clustering Mixed Numeric and Categorical Data Based on the Density and Distance [J] . Ji Jinchao, Pang Wei, Zheng Yanlin, International Journal of Pattern Recognition and Artificial Intelligence . 2015,第7期

机译：基于密度和距离的聚类分类数据混合的初始化方法
3. Soft subspace clustering of categorical data with probabilistic distance [J] . Chen Lifei, Wang Shengrui, Wang Kaijun, Pattern Recognition: The Journal of the Pattern Recognition Society . 2016,第Null期

机译：具有概率距离的分类数据的软子空间聚类
4. Estimation of number of clusters in categorical data via distance-based likelihood function [C] . Zhang Peng, Feng Yaolong, Wang Xiaogang 2011 Seventh International Conference on Natural Computation . 2011

机译：通过基于距离的似然函数估计分类数据中的聚类数目
5. Learning Networks with Categorical Data Using Distance Correlation, and a Novel Graph-based Multivariate Test [D] . Tinker, Jian. 2020

机译：使用距离相关性与分类数据学习网络，以及基于新的基于图的多变量测试
6. A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data [O] . Jinchao Ji, Wei Pang, Yanlin Zheng, -1

机译：一种新的基于人工蜂群的分类数据聚类算法
7. Distance based Clustering for Categorical Data Extended Abstract [O] . Dino Ienco, Rosa Meo 2013

机译：基于距离的分类数据扩展摘要

Clustering Categorical Data Based on Distance Vectors

摘要

著录项

相似文献

相关主题

期刊订阅