Clustering mixtures of Gaussian distributions is a fundamental and challenging problem. State-of-the-art theoretical work on learning Gaussian mixture models has mostly focused on estimating the mixture parameters, where clustering is given as a byproduct. These methods have focused mostly on improving separation bounds for different mixture classes, and doing so in polynomial time and sample complexity. Less emphasis has been given to aligning these algorithms to the challenges of big data. In this paper, we focus on clustering $n$ samples from an arbitrary mixture of $c$-separated Gaussians in $mathbb{R}^p$ in time that is linear in $p$ and $n$, and sample complexity that is independent of $p$. Our analysis suggests that for sufficiently separated Gaussians after $o(log{p})$ random projections a good direction is found that yields a small clustering error. Specifically, for a user-specified error $e$, the expected number of such projections is small and bounded by $o(ln p)$ when $gammaleq csqrt{ln{ln{p}}}$ and $gamma=Q^{-1}(e)$ is the separation of the Gaussians with $Q$ as the tail distribution function of the normal distribution. Consequently, the expected overall running time of the algorithm is linear in $n$ and quasi-linear in $p$ at $o(ln{p})O(np)$, and the sample complexity is independent of $p$. Unlike the methods that are based on $k$-means, our analysis is applicable to any mixture class (spherical or non-spherical). Finally, an extension to $k>2$ components is also provided.
展开▼
机译:高斯分布的混合聚类是一个基本且具有挑战性的问题。有关学习高斯混合模型的最新理论工作主要集中在估计混合参数上,其中聚类作为副产品给出。这些方法主要集中在改进不同混合物类别的分离界限上,并在多项式时间和样品复杂度方面做到这一点。很少强调使这些算法适应大数据的挑战。在本文中,我们集中于将来自$ mathbb {R} ^ p $中由$ c $分隔的高斯分布的任意混合中的$ n $样本聚类,并且在$ p $和$ n $中呈线性,并且样本复杂度高独立于$ p $。我们的分析表明,对于在$ o( log {p})$个随机投影之后经过充分分离的高斯,可以找到一个好的方向,该方向会产生较小的聚类误差。具体来说,对于用户指定的错误$ e $,当$ gamma leq c sqrt { ln { ln {p}}时,此类投影的预期数量较小,并受$ o( ln p)$限制} $和$ gamma = Q ^ {-1}(e)$是高斯项的分离,其中$ Q $是正态分布的尾部分布函数。因此,在$ o( ln {p})O(np)$时,算法的预期总运行时间以$ n $为线性,以$ p $为准线性,并且样本复杂度与$ p $独立。与基于$ k $ -means的方法不同,我们的分析适用于任何混合类别(球形或非球形)。最后,还提供了对$ k> 2 $组件的扩展。
展开▼