首页> 外文期刊>Machine Learning and Knowledge Extraction >Optimal Clustering and Cluster Identity in Understanding High-Dimensional Data Spaces with Tightly Distributed Points
【24h】

Optimal Clustering and Cluster Identity in Understanding High-Dimensional Data Spaces with Tightly Distributed Points

机译:理解具有紧密分布点的高维数据空间时的最佳聚类和聚类身份

获取原文
       

摘要

The sensitivity of the elbow rule in determining an optimal number of clusters inhigh-dimensional spaces that are characterized by tightly distributed data points is demonstrated.The high-dimensional data samples are not artificially generated, but they are taken from a real worldevolutionary many-objective optimization. They comprise of Pareto fronts from the last 10 generationsof an evolutionary optimization computation with 14 objective functions. The choice for analyzingPareto fronts is strategic, as it is squarely intended to benefit the user who only needs one solutionto implement from the Pareto set, and therefore a systematic means of reducing the cardinality ofsolutions is imperative. As such, clustering the data and identifying the cluster from which to pickthe desired solution is covered in this manuscript, highlighting the implementation of the elbow ruleand the use of hyper-radial distances for cluster identity. The Calinski-Harabasz statistic was favoredfor determining the criteria used in the elbow rule because of its robustness. The statistic takes intoaccount the variance within clusters and also the variance between the clusters. This exercise alsoopened an opportunity to revisit the justification of using the highest Calinski-Harabasz criterionfor determining the optimal number of clusters for multivariate data. The elbow rule predictedthe maximum end of the optimal number of clusters, and the highest Calinski-Harabasz criterionmethod favored the number of clusters at the lower end. Both results are used in a unique wayfor understanding high-dimensional data, despite being inconclusive regarding which of the twomethods determine the true optimal number of clusters.
机译:证明了弯头规则在确定以紧密分布的数据点为特征的高维空间中的最佳聚类数时的敏感性。高维数据样本不是人为生成的,而是取自真实世界演化的多目标优化。它们由最近的10代Pareto前沿组成,具有14个目标函数。分析Pareto前沿的选择具有战略意义,因为它的目的是使只需要从Pareto集中实施一个解决方案的用户受益,因此,必须有一种减少解决方案基数的系统方法。这样,在本手稿中涵盖了对数据进行聚类并标识从中选择所需解决方案的聚类,重点介绍了肘部规则的实现以及使用超径向距离进行聚类标识。 Calinski-Harabasz统计量因其稳健性而被认为可用于确定肘部规则中使用的标准。该统计考虑了聚类内的方差以及聚类之间的方差。该练习还为重新探讨使用最高Calinski-Harabasz标准来确定多元数据的最佳聚类数的理由提供了机会。弯头规则预测了最佳簇数的最大端,而最高的Calinski-Harabasz判据法则偏向于较低端的簇数。尽管不确定这两种方法中的哪一种确定集群的最佳数量,但这两种结果都以独特的方式用于理解高维数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号