首页> 外文学位 >Machine learning approaches for determining effective seeds for k-means algorithm.
【24h】

Machine learning approaches for determining effective seeds for k-means algorithm.

机译:用于确定k均值算法的有效种子的机器学习方法。

获取原文
获取原文并翻译 | 示例

摘要

In this study, I investigate and conduct an experiment on two-stage clustering procedures, hybrid models in simulated environments where conditions such as collinearity problems and cluster structures are controlled, and in real-life problems where conditions are not controlled. The first hybrid model (NK) is an integration between a neural network (NN) and the k-means algorithm (KM) where NN screens seeds and passes them to KM. The second hybrid (GK) uses a genetic algorithm (GA) instead of the neural network. Both NN and GA used in this study are in their simplest-possible forms.; In the simulated data sets, I investigate two properties: clustering performance comparisons and effects of five factors (scale, sample size, density, number of clusters, and number of variables) on the five clustering approaches (KM, NN, NK, GA, GK). Density, number of clusters, and dimension influence the clustering performance of all five approaches. KM, NK, and GK classify well when all clusters contain a similar number of observations, while NK and GK perform better than the KM. NN performs well when one cluster contains more observations than any other cluster. The two hybrid models perform at least as well as KM, although the environments are in favor of the KM. The most crucial information, the true number of clusters, is provided to the KM only. In addition, the cluster structures are simple: the clusters are well separated; the variances and cluster sizes are uniform; the correlation between any pair of variables and collinearity problems are not significant; and the observations are normally distributed.; Real-life problems consist of three problems with a known natural cluster structure and one problem with an unknown natural cluster structure. Overall results indicate that GK performs better than KM, while NK is the worst performing among the five approaches. The two machine learning approaches generate better results than KM in an environment that does not favor KM.; GK has shown to be the best or among the best in a simulated environment and in real-life situations. Furthermore, the GK can detect firms with promising financial prospect such as acquisition targets and firms with “buy” recommendation, better than all other approaches.
机译:在这项研究中,我研究并进行了两阶段聚类程序的实验,并在控制共线性问题和聚类结构等条件的模拟环境中以及在不控制条件的实际问题中进行了混合模型的实验。第一个混合模型(NK)是神经网络(NN)和k-均值算法(KM)之间的集成,其中NN筛选种子并将其传递给KM。第二个混合(GK)使用遗传算法(GA)代替神经网络。本研究中使用的NN和GA均为最简单的形式。在模拟的数据集中,我研究了两个属性:聚类性能比较以及五种聚类方法(KM,NN,NK,GA,五个因素(规模,样本大小,密度,聚类数量和变量数量)的影响) GK)。密度,聚类数量和尺寸会影响所有五种方法的聚类性能。当所有聚类包含相似数量的观测值时,KM,NK和GK可以很好地分类,而NK和GK的表现要好于KM。当一个聚类包含比其他聚类更多的观测值时,NN表现良好。尽管环境有利于KM,但这两种混合模型的性能至少与KM相同。最关键的信息(集群的真实数量)仅提供给KM。另外,群集结构很简单:群集之间的分隔良好;方差和簇大小是一致的;变量对和共线性问题之间的相关性不显着;观测值呈正态分布。实际问题包括具有已知自然簇结构的三个问题和具有未知自然簇结构的一个问题。总体结果表明,在五种方法中,GK的性能优于KM,而NK的性能最差。在不利于KM的环境中,这两种机器学习方法比KM产生更好的结果。在模拟环境和现实情况下,GK已证明是最好的或其中最好的。此外,与所有其他方法相比,GK可以检测具有良好财务前景的公司,例如收购目标和具有“购买”建议的公司。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号