We present new algorithms for the k-means clustering problem. They use a new kind of kd-tree traversal algorithm supplemented with a novel pruning test to give sublinear cost both in the number of datapoints and in the number of centers. The kd-trees are decorated with extra 'cached sufficient statistics' as in (3). Sufficient statistics are stored in the nodes of the kd-tree. Then, an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness are included. The kd-tree can also be used to initialize the k-means starting centers efficiently. Our algorithms can be easily extended to provide fast ways of computing the error of a given cluster assignment, regardless of the method in which those clusters were obtained. We also show how to use them in a setting which allows approximate clustering results, with the benefit of running faster.
展开▼