...
首页> 外文期刊>Expert Systems with Application >A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data
【24h】

A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data

机译:针对稀疏,高维数据的基于k均值的共聚簇(kCC)算法

获取原文
获取原文并翻译 | 示例
           

摘要

The k-means algorithm is a widely used method that starts with an initial partitioning of the data and then iteratively converges towards the local solution by reducing the Sum of Squared Errors (SSE). It is known to suffer from the cluster center initialization problem and the iterative step simply (re-)labels the data points based on the initial partition. Most improvements to k-means proposed in the literature focus on the initialization step alone but make no attempt to guide the iterative convergence by exploiting statistical information from the data. Using higher order statistics (such as paths from random walks in a graph) and the duality in the data (as in co-clustering), for instance, are known ways to improve the clustering results. What is unique and significant in our proposed approach is that we embed these concepts into the k-means algorithm rather than just using them as an external distance measure and present a unified framework called the k-means based co-clustering (kCC) Algorithm. The initialization step has been modified to include multiple points to represent each cluster center such that points within a cluster are close together but are far from points representing other clusters. Moreover, neighborhood walk statistics is proposed as a semantic similarity technique for both cluster assignment and center re estimation in the iterative process. The effectiveness of the combined approach is evaluated on several standard data sets. Our results show that kCC performs better as compared to the baseline k-means and other state-of-the-art improvements. (C) 2018 Elsevier Ltd. All rights reserved.
机译:k均值算法是一种广泛使用的方法,该方法从对数据进行初始分区开始,然后通过减少平方误差和(SSE)迭代地收敛到局部解。已知遭受集群中心初始化问题的困扰,并且迭代步骤仅基于初始分区简单地(重新)标记数据点。文献中提出的对k-means的大多数改进都集中在初始化步骤上,但没有尝试通过利用来自数据的统计信息来指导迭代收敛。例如,使用高阶统计量(例如来自图形中随机游走的路径)和数据的对偶性(如在共聚中)是改善聚类结果的已知方法。在我们提出的方法中,唯一且有意义的是,我们将这些概念嵌入到k-means算法中,而不仅仅是将它们用作外部距离度量,并且提出了一个统一的框架,称为基于k-means的共聚(kCC)算法。初始化步骤已被修改为包括多个点,以表示每个群集中心,以使群集内的点靠得很近,但与表示其他群集的点相距较远。此外,在步行过程中,将邻域步行统计作为一种语义相似性技术用于聚类分配和中心重估计。在几种标准数据集上评估了组合方法的有效性。我们的结果表明,与基准k均值和其他最新改进相比,kCC的性能更好。 (C)2018 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号