首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time
【24h】

Towards Clustering High-dimensional Gaussian Mixture Clouds in Linear Running Time

机译:在线性运行时间内实现高维高斯混合云的聚类

获取原文
           

摘要

Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. State-of-the-art theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$-separated Gaussians in $mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a user-specified error $e$, the expected number of such projections is small and bounded by $o(ln p)$ when $gammaleq csqrt{ln{ln{p}}}$ and $gamma=Q^{-1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$-means, our analysis is applicable to any mixture class (spherical or non-spherical). Finally, an extension to $k>2$ components is also provided.
机译:高斯分布的混合聚类是一个基本且具有挑战性的问题。有关学习高斯混合模型的最新理论工作主要集中在估计混合参数上,其中聚类作为副产品给出。这些方法主要集中在改进不同混合物类别的分离界限上,并在多项式时间和样品复杂度方面做到这一点。很少强调使这些算法适应大数据的挑战。在本文中,我们集中于将来自$ mathbb {R} ^ p $中由$ c $分隔的高斯分布的任意混合中的$ n $样本聚类,并且在$ p $和$ n $中呈线性,并且样本复杂度高独立于$ p $。我们的分析表明,对于在$ o( log {p})$个随机投影之后经过充分分离的高斯,可以找到一个好的方向,该方向会产生较小的聚类误差。具体来说,对于用户指定的错误$ e $,当$ gamma leq c sqrt { ln { ln {p}}时,此类投影的预期数量较小,并受$ o( ln p)$限制} $和$ gamma = Q ^ {-1}(e)$是高斯项的分离,其中$ Q $是正态分布的尾部分布函数。因此,在$ o( ln {p})O(np)$时,算法的预期总运行时间以$ n $为线性,以$ p $为准线性,并且样本复杂度与$ p $独立。与基于$ k $ -means的方法不同,我们的分析适用于任何混合类别(球形或非球形)。最后,还提供了对$ k> 2 $组件的扩展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号