首页> 外文期刊>BMC Bioinformatics >A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data
【24h】

A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

机译:使用标准化学计量学方法对光谱数据进行特征选择和分类的随机森林及其基尼重要性的比较

获取原文
           

摘要

Background Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space. Results We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features. Conclusion The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.
机译:背景技术正则化回归方法(例如主成分回归法或偏最小二乘回归法)在高维光谱数据的学习任务中表现良好,但无法明确消除不相关的特征。另一方面,具有相关基尼特征重要性的随机森林分类器允许显式消除特征,但由于其组成分类树的拓扑结构(基于特征空间中的正交划分)可能无法最佳地适应光谱数据。结果我们建议结合两种方法的最佳方法,并结合基于随机森林的基尼重要性的递归特征消除方法以及基于医学诊断,化学分类学,生物医学的光谱数据集的常规分类方法,评估基于特征选择的联合使用分析,食品科学和综合修改后的光谱数据。在这里,使用基尼特征重要性进行特征选择,并通过判别式偏最小二乘回归进行正则化分类,效果优于或优于根据不同单变量统计检验进行的滤波,或者在回归特征消除中使用回归系数。它优于随机森林分类器的直接应用或正则化分类器在整个功能集上的直接应用。结论随机森林的基尼重要性为测量光谱数据上的特征相关性提供了更好的方法,但是-在最佳的特征子集上-尽管分类器仅限于对线性相关性进行建模,但正规分类器可能优于随机森林分类器。但是,基于基尼重要性的特征选择可以在规则化线性分类之前进行,以识别特征的最佳子集,并获得降维和从分类任务中消除噪声的双重好处。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号