【24h】

Intra-feature Random Forest Clustering

机译:功能内随机森林聚类

获取原文

摘要

Clustering algorithms are commonly used to find structure in data without explicitly being told what they are looking for. One key desideratum of a clustering algorithm is that the clusters it identifies given some set of features will generalize well to features that have not been measured. Yeung et al. (2001) introduce a Figure of Merit closely aligned to this desideratum, which they use to evaluate clustering algorithms. Broadly, the Figure of Merit measures the within-cluster variance of features of the data that were not available to the clustering algorithm. Using this metric, Yeung et al. found no clustering algorithms that reliably outperformed k-means on a suite of real world datasets (Yeung et al. 2001). This paper presents a novel clustering algorithm, intra-feature random forest clustering (IRFC), that does outperform k-means on a variety of real world datasets per this metric. IRFC begins by training an ensemble of decision trees of limited depth to predict randomly selected features given the remaining features. It then aggregates the partitions that are implied by these trees, and outputs however many clusters are specified by an input parameter.
机译:聚类算法通常用于在数据中查找结构,而无需明确告知它们在寻找什么。聚类算法的一个关键要求是,它在给定的一组特征下识别出的聚类将很好地推广到尚未测量的特征。 Yeung等。 (2001年)介绍了一个与该目标密切相关的品质因数图,他们将其用于评估聚类算法。广义而言,品质因数衡量的是聚类算法不可用的数据特征的簇内差异。 Yeung等使用此指标。在真实世界的数据集上,没有发现能可靠地胜过k均值的聚类算法(Yeung等,2001)。本文提出了一种新颖的聚类算法,即功能内随机森林聚类(IRFC),该算法在此指标上的表现优于各种现实数据集上的k均值。 IRFC首先训练一组深度有限的决策树,以在给定其余特征的情况下预测随机选择的特征。然后,它聚合这些树所隐含的分区,并输出,但是输入参数指定了许多群集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号