...
首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Gene selection with guided regularized random forest
【24h】

Gene selection with guided regularized random forest

机译:引导正规化随机森林的基因选择

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The regularized random forest (RRF) was recently proposed for feature selection by building only one ensemble. In RRF the features are evaluated on a part of the training data at each tree node. We derive an upper bound for the number of distinct Gini information gain values in a node, and show that many features can share the same information gain at a node with a small number of instances and a large number of features. Therefore, in a node with a small number of instances, RRF is likely to select a feature not strongly relevant. Here an enhanced RRF, referred to as the guided RRF (GRRF), is proposed. In GRRF, the importance scores from an ordinary random forest (RF) are used to guide the feature selection process in RRF. Experiments on 10 gene data sets show that the accuracy performance of GRRF is, in general, more robust than RRF when their parameters change. GRRF is computationally efficient, can select compact feature subsets, and has competitive accuracy performance, compared to RRF, varSelRF and LASSO logistic regression (with evaluations from an RF classifier). Also, RF applied to the features selected by RRF with the minimal regularization outperforms RF applied to all the features for most of the data sets considered here. Therefore, if accuracy is considered more important than the size of the feature subset, RRF with the minimal regularization may be considered. We use the accuracy performance of RF, a strong classifier, to evaluate feature selection methods, and illustrate that weak classifiers are less capable of capturing the information contained in a feature subset. Both RRF and GRRF were implemented in the "RRF" R package available at CRAN, the official R package archive.
机译:最近提出通过仅构建一个整体来进行特征选择的正规化随机森林(RRF)。在RRF中,在每个树节点的一部分训练数据上对特征进行评估。我们得出一个节点中不同的Gini信息增益值的数量的上限,并表明许多特征可以在具有少量实例和大量特征的节点上共享相同的信息增益。因此,在实例数量少的节点中,RRF可能会选择不很相关的功能。在此提出了一种增强的RRF,称为引导RRF(GRRF)。在GRRF中,来自普通随机森林(RF)的重要性得分用于指导RRF中的特征选择过程。对10个基因数据集进行的实验表明,当GRRF的参数发生变化时,它们的准确性通常比RRF更强。与RRF,varSelRF和LASSO logistic回归(来自RF分类器的评估)相比,GRRF具有高效的计算能力,可以选择紧凑的特征子集并具有极好的准确性。同样,应用于RRF选择的特征且具有最小化正则化的RF优于针对此处考虑的大多数数据集应用于所有特征的RF。因此,如果认为精度比特征子集的大小更重要,则可以考虑具有最小化正则化的RRF。我们使用强分类器RF的准确性来评估特征选择方法,并说明弱分类器无法捕获特征子集中包含的信息。 RRF和GRRF都在官方R包档案库CRAN的“ RRF” R包中实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号