首页> 外文会议>International Joint Conference on Neural Networks >Speeding-up the prototype based kernel k-means clustering method for large data sets
【24h】

Speeding-up the prototype based kernel k-means clustering method for large data sets

机译:加速基于原型的基于核K-means聚类方法,用于大数据集

获取原文
获取外文期刊封面目录资料

摘要

Kernel k-means is seen as a non-linear extension of the k-means clustering method, with good performance in identifying non-isotropic and linearly inseparable clusters. However space and time requirement of kernel k-means is expensive with O(n2) complexity. Present applications with large in-memory computations make this method insuitable for large data sets. Recently, a simple prototype based hybrid approach speedsup kernel k-means method for large data sets [1]. The time complexity of this method is O(n + p2), where p is the number of prototypes. Each prototype is a representative pattern of a group-let of size (threshold) τ . The time complexity of this method not only depends upon p but which in turn depends on clustering threshold. Increasing the threshold value can decrease the number of prototypes p, but, quality of the clustering result might suffer. Hence fixing the appropriate value of the threshold is the major challenge in this approach. This paper, presents a solution to this problem, by allowing τ to vary, depending on the location of the group-let in the space. Intuitively, If the grouplet is close to a cluster center (and away from others) then its size could be large, but if it is lying somewhere between two cluster centers, then its size should be small. It is experimentally shown that this reduces the clustering time and also increases the clustering accuracy. The presented method is a suitable one for large data sets like in data mining.
机译:内核K-means被视为K-Means聚类方法的非线性延伸,在识别非各向同性和线性不可分割的簇方面具有良好的性能。然而,核K-means的空间和时间要求与O(n2)复杂性昂贵。具有大内存计算的现有应用程序使该方法可以介绍大型数据集。近来,一个简单的基于原型的混合方法SpeedSup内核K-均值用于大数据集的方法[1]。该方法的时间复杂性是O(n + p2),其中p是原型的数量。每个原型是允许尺寸(阈值)τ的组的代表性模式。此方法的时间复杂性不仅取决于P但又取决于聚类阈值。增加阈值可以减少原型P的数量,但是,聚类结果的质量可能会受到影响。因此,修复了阈值的适当价值是这种方法中的主要挑战。本文通过允许τ变化,提出了解决此问题的解决方案,具体取决于集团的位置在空间中的位置。直观地,如果Grouplet靠近集群中心(以及远离他人)那么它的尺寸可能很大,但如果它躺在两个集群中心之间的某个地方,那么它的尺寸应该很小。它在实验上表明,这减少了聚类时间并增加了聚类精度。呈现的方法是用于数据挖掘中的大数据集的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号