...
首页> 外文期刊>Theory of computing systems >Small Space Representations for Metric Min-sum k-Clustering and Their Applications
【24h】

Small Space Representations for Metric Min-sum k-Clustering and Their Applications

机译:度量最小和k聚类的小空间表示形式及其应用

获取原文
获取原文并翻译 | 示例
           

摘要

The min-sum k-clustering problem is to partition a metric space (P, d) into k clusters C_1,…, C_k is contained in P such that ∑_(I=1)~ki ∑_(p,q∈C_I) d(p,q) is minimized. We show the first efficient construction of a coreset for this problem. Our coreset construction is based on a new adaptive sampling algorithm. With our construction of coresets we obtain two main algorithmic results.rnThe first result is a sublinear-time (4 + ∈)-approximation algorithm for the min-sum ^-clustering problem in metric spaces. The running time of this algorithm is O(n) for any constant k and ∈, and it is o(n~2) for all k = o(log n/loglog n). Since the full description size of the input is Θ(n~2), this is sublinear in the input size. The fastest previously known o(logn)-factor approximation algorithm for k > 2 achieved a running time of Ω (n~k), and no non-trivial o(n~2)-time algorithm was known before.rnOur second result is the first pass-efficient data streaming algorithm for min-sum k-clustering in the distance oracle model, I.e., an algorithm that uses poly(logn, k) space and makes 2 passes over the input point set, which arrives in form of a data stream in arbitrary order. It computes an implicit representation of a clustering of (P,d) with cost at most a constant factor larger than that of an optimal partition. Using one further pass, we can assign each point to its corresponding cluster.rnTo develop the coresets, we introduce the concept of α-preserving metric embed-dings. Such an embedding satisfies properties that the distance between any pair of points does not decrease and the cost of an optimal solution for the considered problem on input (P, d') is within a constant factor of the optimal solution on input (P, d). In other words, the goal is to find a metric embedding into a (structurally simpler) metric space that approximates the original metric up to a factor of α with respect to a given problem. We believe that this concept is an interesting generalization of coresets.
机译:最小和k聚类问题是将度量空间(P,d)划分为k个聚类C_1,…,C_k包含在P中,使得∑_(I = 1)〜ki ∑_(p,q∈C_I )d(p,q)被最小化。我们展示了针对此问题的核心组的第一个有效构造。我们的核心集构建基于新的自适应采样算法。通过构造核集,我们获得了两个主要的算法结果。第一个结果是度量空间中最小和^聚类问题的亚线性时间(4 +ε)-近似算法。对于任何常数k和∈,该算法的运行时间为O(n),对于所有k = o(log n / loglog n),它的运行时间为o(n〜2)。由于输入的完整描述大小为Θ(n〜2),因此输入大小为次线性。对于k> 2而言,先前已知的最快的o(logn)因子逼近算法实现了Ω(n〜k)的运行时间,并且以前没有非平凡的o(n〜2)-时间算法是未知的。距离预言模型中用于最小总和k聚类的第一种传递效率高的数据流算法,即一种使用poly(logn,k)空间并使输入点集经过2次传递的算法,其形式为a数据流以任意顺序。它计算(P,d)聚类的隐式表示,其成本最多为比最佳分区大的恒定因子。使用另一遍,我们可以将每个点分配给其对应的簇。为了开发核心集,我们引入了保α度量嵌入的概念。这样的嵌入满足以下特性:任意一对点之间的距离都不会减小,并且针对输入上考虑的问题(P,d')的最优解的成本在输入(P,d)的最优解的常数因子之内)。换句话说,目标是找到一个嵌入到(结构上更简单)度量空间中的度量,该度量相对于给定问题将原始度量近似为因子α。我们认为,此概念是对核心集的有趣概括。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号