首页> 外文会议>Annual European symposium on algorithms >BICO: BIRCH Meets Coresets for k-Means Clustering
【24h】

BICO: BIRCH Meets Coresets for k-Means Clustering

机译:BICO:BIRCH满足k-Means聚类的核心集

获取原文

摘要

We design a data stream algorithm for the k-means problem, called BICO, that combines the data structure of the SIGMOD Test of Time award winning algorithm BIRCH with the theoretical concept of coresets for clustering problems. The k-means problem asks for a set C of k centers minimizing the sum of the squared distances from every point in a set P to its nearest center in C. In a data stream, the points arrive one by one in arbitrary order and there is limited storage space. BICO computes high quality solutions in a time short in practice. First, BICO computes a summary S of the data with a provable quality guarantee: For every center set C, S has the same cost as P up to a (1 + ε)-factor, i. e., S is a coreset. Then, it runs k-means++ on S. We compare BICO experimentally with popular and very fast heuristics (BIRCH, MacQueen [24]) and with approximation algorithms (Stream-KM++ , StreamLS ) with the best known quality guarantees. We achieve the same quality as the approximation algorithms mentioned with a much shorter running time, and we get much better solutions than the heuristics at the cost of only a moderate increase in running time.
机译:我们设计了一种针对k均值问题的数据流算法,称为BICO,该算法将SIGMOD时间测试获奖算法BIRCH的数据结构与用于聚类问题的核心集的理论概念相结合。 k均值问题要求由k个中心组成的集合C最小化从集合P中的每个点到它在C中最近的中心的平方距离之和。在数据流中,这些点以任意顺序一个到一个地到达是有限的存储空间。在实践中,BICO可以在短时间内计算出高质量的解决方案。首先,BICO计算出具有可证明质量保证的数据摘要S:对于每个中心集C,S的成本与P相同,但高达(1 +ε)因子,即。例如,S是一个核心集。然后,它在S上运行k-means ++。我们通过实验将BICO与流行且非常快的启发式算法(BIRCH,MacQueen [24])和近似算法(Stream-KM ++,StreamLS)进行比较,该算法具有最著名的质量保证。我们以更短的运行时间实现了与上述近似算法相同的质量,并且以仅适度增加运行时间为代价,获得了比启发式方法更好的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号