【24h】

Robust $k$-means++

机译:鲁棒$ k $ -means ++

获取原文
       

摘要

A good seeding or initialization of cluster centers for the $k$-means method is important from both theoretical and practical standpoints. The $k$-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the $k$-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data.For any $0 < delta leq 1$, we show that using a mixture of $D^{2}$?[3] and uniform sampling, we can pick $O(k/delta)$ candidate centers with the following guarantee: they contain some $k$ centers that give $O(1)$-approximation to the optimal robust $k$-means solution while discarding at most $delta n$ more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest $eta n$ points as outliers, our solution discards its $(eta + delta) n$ points as outliers. The constant factor in our $O(1)$-approximation does not depend on $delta$. This is an improvement over previous results for $k$-means with outliers based on LP relaxation and rounding [7] and local search [17]. The $O(k/delta)$ sized subset can be found in time $O(ndk)$. Our emph{robust} $k$-means++ is also easily amenable to scalable, faster, parallel implementations of $k$-means++ [5]. Our empirical results show a comparison of the above emph{robust} variant of $k$-means++ with the usual $k$-means++, uniform random seeding, threshold $k$-means++?[6] and local search on real world and synthetic data.
机译:k $ -means方法的良好播种或初始化为$ k $ -means方法是从理论和实际的角度来看重要的。 $ k $ -means目标对异常值本质上是不稳健的和敏感的。一种流行的种子,如$ k $ -means ++ [3],这些播种者更有可能在最坏情况下选择异常值可能会复制这种缺点,从而影响嘈杂数据上的聚类质量。对于任何$ 0 < delta LEQ 1 $ ,我们展示了使用$ d ^ {2} $的混合物?[3]和统一的采样,我们可以选择$ o(k / delta)$候选中心,其中包含以下担保:它们包含一些$ k $中心提供$ O(1)$ - 近似为最佳稳健$ k $ -means解决方案,同时丢弃大多数$ delta n $的点,而不是最佳解决方案丢弃的异常值。也就是说,如果最佳解决方案丢弃其最远的$ Beta n $积分作为异常值,我们的解决方案将丢弃其$( beta + delta)n $积分作为异常值。我们的$ O(1)$ - 近似的恒定因素不依赖于$ delta $。这是对以k $的先前结果的改进 - 基于LP松弛和舍入[7]和本地搜索[17]的异常值。 $ O(k / delta)$大小的子集可以及时找到$ o(ndk)$。我们的 emph {rubust} $ k $ -means ++也可以轻松实现可扩展,更快,并行实现$ k $ -means ++ [5]。我们的经验结果显示了上述 k $ -means ++的上述 memph {rubust} variant的比较,常用$ k $ -means ++,统一的随机播种,阈值$ k $ -means ++?[6]和现场搜索和合成数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号