...
首页> 外文期刊>Journal of machine learning research >Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
【24h】

Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

机译:使用von Mises-Fisher分布在单位超球面上聚类

获取原文

摘要

Several large scale data mining applications, such as textcategorization and gene expression analysis, involve high-dimensionaldata that is also inherently directional in nature. Often such datais L2 normalized so that it lies on the surface of aunit hypersphere. Popular models such as (mixtures of) multi-variateGaussians are inadequate for characterizing such data. This paperproposes a generative mixture-model approach to clustering directionaldata based on the von Mises-Fisher (vMF) distribution, which arisesnaturally for data distributed on the unit hypersphere. Inparticular, we derive and analyze two variants of the ExpectationMaximization (EM) framework for estimating the mean and concentrationparameters of this mixture. Numerical estimation of the concentrationparameters is non-trivial in high dimensions since it involvesfunctional inversion of ratios of Bessel functions. We also formulatetwo clustering algorithms corresponding to the variants of EM that wederive. Our approach provides a theoretical basis for the use ofcosine similarity that has been widely employed by the informationretrieval community, and obtains the spherical kmeans algorithm(kmeans with cosine similarity) as a special case of both variants.Empirical results on clustering of high-dimensional text andgene-expression data based on a mixture of vMF distributions show thatthe ability to estimate the concentration parameter for each vMFcomponent, which is not present in existing approaches, yieldssuperior results, especially for difficult clustering tasks inhigh-dimensional spaces. color="gray">
机译:一些大型数据挖掘应用程序,例如文本分类和基因表达分析,涉及到高维度数据,这些数据本质上也具有固有的方向性。通常将此类数据 L 2 归一化,使其位于非单位超球面的表面。流行的模型(例如,多元高斯混合)不足以表征此类数据。本文提出了一种基于von Mises-Fisher(vMF)分布的方向性数据聚类的生成混合模型方法,该方法自然而然地出现在单位超球面上的数据分布中。特别是,我们推导并分析了ExpectationMaximization(EM)框架的两个变体,用于估计此混合物的平均值和浓度参数。浓度参数的数值估计在高维方面并非易事,因为它涉及贝塞尔函数比率的函数求逆。我们还制定了两种与衍生的EM变体相对应的聚类算法。我们的方法为信息检索社区广泛使用的余弦相似度的使用提供了理论基础,并获得了球形kmeans算法(具有余弦相似度的kmeans)作为这两种变体的特殊情况。高维文本聚类的经验结果基于vMF分布的混合数据和基因表达数据表明,估计每种vMF组分的浓度参数的能力(现有方法中不存在)产生了优异的结果,尤其是对于高维空间中的困难聚类任务而言。 color =“ gray “>

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号