首页> 外文会议>IEEE International Conference on Soft Computing and Machine Intelligence >Learning the $k$ in $k$-means via the Camp-Meidell Inequality
【24h】

Learning the $k$ in $k$-means via the Camp-Meidell Inequality

机译:通过Camp-Meidell不等式学习 $ k $ -均值中的 $ k $

获取原文
获取外文期刊封面目录资料

摘要

Determining $k$, the number clusters, is a key challenge in $k$-means clustering. We propose an approach, named Automatic $k$-means ($k$-means-auto), which judges a $k$ value to be right, if at least 88.89% of $pmb{D}$ is less than twice its standard deviation, where $pmb{D}$ is a vector of length $pmb{N}$ (number of points) comprising the distance of each point to its cluster centroid. The bound is derived from Camp-Meidell's inequality for unimodal and symmetrical distributions. First in this paper, we show that $k$-means' equal-variance Gaussian cluster assumption induces Gaussian fit for $pmb{D}$. Thus, if $pmb{D}$ is Gaussian, then all clusters are Gaussian, and the underlying $k$ value is appropriate. We chose to test $pmb{D}$ for unimodality and symmetry instead of Gaussian fit, as a means of relaxation, since clusters in real data are hardly perfectly Gaussian. The proposed approach is fast since $pmb{D}$ does not have to be computed afresh: it is already available as a by-product of the clustering process. On 10 well-known datasets, $k$-means-auto finds the true $pmb{k}$ in 6 cases and comes very close ($pm 1$ cluster) in 3 other cases. Compared with two well-known methods, gap statistics and silhouette, it outperforms the former and competes closely (statistically insignificant difference at 95% confidence) with the latter. Over all the cases, the running time range for $k$-means-auto, gap and silhouette are 0.02s-0.44s (factor of 22), 0.31s-13.63s (factor of 108) and 0.02s-70.78s (factor of 3539) respectively. Thus, $k$-means-auto surpasses the other two methods in efficiency.
机译:决定 $ k $ 是数字集群,这是一个关键的挑战 $ k $ -表示聚类。我们提出了一种名为“自动”的方法 $ k $ -方法 ( $ k $ -means-auto),它会判断一个 $ k $ 值是正确的,如果至少88.89% $ \ pmb {D} $ 小于其标准偏差的两倍,其中 $ \ pmb {D} $ 是长度的向量 $ \ pmb {N} $ (点数),包括每个点与其簇质心的距离。该界限是从Camp-Meidell不等式得出的,用于单峰和对称分布。首先,我们证明 $ k $ -均值的等方差高斯聚类假设可推导高斯拟合 $ \ pmb {D} $ 。因此,如果 $ \ pmb {D} $ 是高斯,那么所有聚类都是高斯,基础 $ k $ 值合适。我们选择测试 $ \ pmb {D} $ 因为单模态和对称性而不是高斯拟合,因此可以放松,因为实际数据中的聚类几乎不是完美的高斯。提议的方法很快,因为 $ \ pmb {D} $ 不必重新计算:它已经可以作为聚类过程的副产品使用。在10个著名的数据集上, $ k $ -意味着自动找到真实的 $ \ pmb {k} $ 在6种情况下,非常接近( $ \ pm 1 $ 群集)在其他3种情况下。与间隙统计和轮廓这两种众所周知的方法相比,它的性能优于前者,并且与后者竞争激烈(统计上的差异很小,置信度为95%)。在所有情况下,运行时间范围为 $ k $ -均值自动,间隙和轮廓分别为0.02s-0.44s(因子22),0.31s-13.63s(因子108)和0.02s-70.78s(因子3539)。因此, $ k $ -means-auto在效率上超过了其他两种方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号