首页> 外文会议>Uncertainty in Artificial Intelligence >Optimal Time Bounds for Approximate Clustering
【24h】

Optimal Time Bounds for Approximate Clustering

机译:近似聚类的最佳时间范围

获取原文

摘要

Clustering is a fundamental problem in unsuper-vised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klogn/k)) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say1/100) probability. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.
机译:聚类是无监督学习中的一个基本问题,并且已被广泛研究为学习混合模型的问题和优化问题。在本文中,我们研究了关于k中值目标函数的聚类,这是聚类的一种自然表述,其中我们试图使到聚类中心的平均距离最小。本文的主要贡献之一是一种简单但功能强大的采样技术,我们称之为可能具有独立利益的连续采样。我们表明,我们的采样过程可以快速识别出少量点(大小仅为O(klogn / k)),这些点汇总了用于聚类目的的输入点。使用连续采样,我们针对k中值问题开发了一种算法,该算法在O(nk)时间内针对k的较大范围值运行,并有可能以最大的概率返回成本最大为常数倍的解决方案最佳的。我们还针对k中位数问题的任何随机常数因数近似算法建立了Ω(nk)的下限,该概率即使以可忽略的(say / 100)概率也能成功。该问题的最佳先前上限是O(nk),其中O标记将n和k中的对数因子隐藏起来。 Ω(nk)的最佳先前下界仅适用于确定性k中值算法。虽然我们将介绍的重点放在k均值目标上,但我们的所有上限对k均值目标也同样有效。在这种情况下,我们的算法可与广泛使用的k均值启发式算法相提并论,后者仅需要O(nk)次迭代时间,就无法提供有用的近似保证。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号