首页> 外文期刊>Pattern recognition letters >Semi-supervised clustering of large data sets with kernel methods
【24h】

Semi-supervised clustering of large data sets with kernel methods

机译:使用内核方法的大数据集的半监督聚类

获取原文
获取原文并翻译 | 示例

摘要

Labelling real world data sets is a difficult problem. Often, the human expert is unsure about a class label of a specific sample point or, in case of very large data sets, it is impractical to label them manually. In semi-supervised clustering, the sample labels, which are external informations, are used to find better matching cluster partitions. Further, kernel-based clustering methods are able to partition the data with nonlinear boundaries in feature space. While these methods improve the clustering results, they have a quadratic computation time. In this paper, we propose a meta-algorithm that processes small-sized subsets of a large data set, clusters them with the sample labels and merges the points close to the resulting prototypes with the next points, until the whole data set has been processed. It has a linear computation time. The error function that this meta-algorithm minimizes is presented. Although we applied this meta-algorithm to Kernel Fuzzy C-Means, Relational Neural Gas and Kernel K-Means, it can be applied to a broad range of kernel-based clustering methods. The proposed method has been empirically evaluated on two real world benchmark data sets.
机译:标记现实世界数据集是一个难题。通常,人类专家不确定特定采样点的类别标签,或者在非常大的数据集的情况下,手动标记它们是不切实际的。在半监督群集中,样本标签是外部信息,用于查找更匹配的群集分区。此外,基于内核的聚类方法能够在特征空间中使用非线性边界对数据进行分区。虽然这些方法改善了聚类结果,但它们具有二次计算时间。在本文中,我们提出了一种元算法,该算法处理大型数据集的小型子集,将它们与样本标签聚类,并将与所得原型接近的点与下一个点合并,直到处理完整个数据集为止。它具有线性计算时间。提出了该元算法最小化的误差函数。尽管我们将此元算法应用于内核模糊C均值,关系神经气体和内核K均值,但是它可以应用于各种基于内核的聚类方法。所提出的方法已在两个实际基准数据集上进行了经验评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号