首页> 外文期刊>Applied Soft Computing >An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application
【24h】

An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application

机译:用于地理参考数据的k均值聚类的初始种子选择算法,以提高用于地图绘制应用的聚类分配的可复制性

获取原文
获取原文并翻译 | 示例
       

摘要

K-means is one of the most widely used clustering algorithms in various disciplines, especially for large datasets. However the method is known to be highly sensitive to initial seed selection of cluster centers. K-means++ has been proposed to overcome this problem and has been shown to have better accuracy and computational efficiency than k-means. In many clustering problems though - such as when classifying georeferenced data for mapping applications - standardization of clustering methodology, specifically, the ability to arrive at the same cluster assignment for every run of the method i.e. replicability of the methodology, may be of greater significance than any perceived measure of accuracy, especially when the solution is known to be non-unique, as in the case of k-means clustering. Here we propose a simple initial seed selection algorithm for k-means clustering along one attribute that draws initial cluster boundaries along the "deepest valleys" or greatest gaps in dataset. Thus, it incorporates a measure to maximize distance between consecutive cluster centers which augments the conventional k-means optimization for minimum distance between cluster center and cluster members. Unlike existing initialization methods, no additional parameters or degrees of freedom are introduced to the clustering algorithm. This improves the replicability of cluster assignments by as much as 100% over k-means and k-means++, virtually reducing the variance over different runs to zero, without introducing any additional parameters to the clustering process. Further, the proposed method is more computationally efficient than k-means++ and in some cases, more accurate.
机译:K-means是各种学科中使用最广泛的聚类算法之一,尤其是对于大型数据集。然而,已知该方法对簇中心的初始种子选择高度敏感。已经提出了K-means ++来克服这个问题,并且已经证明K-means ++比k-means具有更好的准确性和计算效率。但是,在许多聚类问题中(例如,在对用于地图应用程序的地理参考数据进行分类时),聚类方法的标准化,尤其是对于方法的每次运行都达到相同的聚类分配的能力,即方法的可复制性,可能比任何可感知的准确性度量,尤其是在已知解决方案不唯一的情况下,例如在k均值聚类的情况下。在这里,我们提出了一种简单的初始种子选择算法,用于沿着一个属性进行k均值聚类,该属性沿“数据集的最深谷”或最大差距绘制了初始聚类边界。因此,它采用了一种措施来最大化连续聚类中心之间的距离,这增加了常规的k均值优化,以实现聚类中心与聚类成员之间的最小距离。与现有的初始化方法不同,聚类算法没有引入其他参数或自由度。与k-means和k-means ++相比,这将群集分配的可复制性提高了100%,实际上将不同运行的方差降低为零,而无需在群集过程中引入任何其他参数。此外,所提出的方法比k-means ++具有更高的计算效率,并且在某些情况下更准确。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号