首页> 外文会议> >Clustering Large Categorical Data
【24h】

Clustering Large Categorical Data

机译:聚类大型分类数据

获取原文

摘要

Clustering methods often come down to the optimization of a numeric criterion defined from a distance or from a dissimilarity measure. It is possible to show that this problem is often equivalent to the estimation of the parameters of a probabilistic model under the classification likelihood approach. For instance, we know that the inertia criterion optimized under the k-means algorithm corresponds to the hypothesis of a population arising from a Gaussian mixture. In this paper, we propose an adapted mixture model for categorical data. Using the classification likelihood approach, we develop the Classification EM algorithm (CEM) to estimate the parameters of the mixture model. With our probabilistic model, the data are not denatured and the estimated parameters readily indicate the characteristics of the clusters. This probabilistic approach gives an interpretation of the criterion optimized by the k-modes algorithm which is an extension of k-means to categorical attributes and allows us to study the behavior of this algorithm.
机译:聚类方法通常归结为根据距离或差异度量定义的数字标准的优化。可能表明,该问题通常等同于在分类似然法下对概率模型参数的估计。例如,我们知道在k均值算法下优化的惯性准则对应于高斯混合产生的总体假设。在本文中,我们提出了一种适用于分类数据的混合模型。使用分类可能性方法,我们开发了分类EM算法(CEM)来估计混合模型的参数。使用我们的概率模型,数据不会被变性,估计的参数很容易表明聚类的特征。这种概率方法可以解释由k-modes算法优化的标准,它是k-means对分类属性的扩展,可以让我们研究该算法的行为。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号