首页> 美国卫生研究院文献>other >Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology
【2h】

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology

机译:评估生态学中随机森林建模的变量选择方法的准确性和稳定性

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Random forest (RF) modeling has emerged as an important statistical learning method in ecology due to its exceptional predictive performance. However, for large and complex ecological data sets there is limited guidance on variable selection methods for RF modeling. Typically, either a preselected set of predictor variables are used, or stepwise procedures are employed which iteratively remove variables according to their importance measures. This paper investigates the application of variable selection methods to RF models for predicting probable biological stream condition. Our motivating data set consists of the good/poor condition of n = 1365 stream survey sites from the 2008/2009 National Rivers and Stream Assessment, and a large set (p = 212) of landscape features from the StreamCat data set as potential predictors. We compare two types of RF models: a full variable set model with all 212 predictors, and a reduced variable set model selected using a backwards elimination approach. We assess model accuracy using RF’s internal out-of-bag estimate, and a cross-validation procedure with validation folds external to the variable selection process. We also assess the stability of the spatial predictions generated by the RF models to changes in the number of predictors, and argue that model selection needs to consider both accuracy and stability. The results suggest that RF modeling is robust to the inclusion of many variables of moderate to low importance. We found no substantial improvement in cross-validated accuracy as a result of variable reduction. Moreover, the backwards elimination procedure tended to select too few variables, and exhibited numerous issues such as upwardly biased out-of-bag accuracy estimates and instabilities in the spatial predictions. We use simulations to further support and generalize results from the analysis of real data. A main purpose of this work is to elucidate issues of model selection bias and instability to ecologists interested in using RF to develop predictive models with large environmental data sets.
机译:由于其出色的预测性能,随机森林(RF)建模已成为一种重要的生态学统计学习方法。但是,对于大型和复杂的生态数据集,关于用于RF建模的变量选择方法的指导有限。通常,使用预选的一组预测变量,或采用逐步过程根据其重要性度量迭代地删除变量。本文研究了变量选择方法在RF模型中预测可能的生物流状况的应用。我们的激励性数据集包括来自2008/2009年国家河流与河流评估的n = 1365个河流调查站点的好/差情况,以及来自StreamCat数据集的大量地形特征(p = 212)作为潜在的预测因素。我们比较了两种类型的RF模型:具有所有212个预测变量的完全变量集模型,以及使用向后消除方法选择的简化变量集模型。我们使用RF的内部预算评估以及变量选择过程外部的带有验证折叠的交叉验证程序来评估模型的准确性。我们还评估了RF模型生成的空间预测的稳定性,以预测变量数量的变化,并认为模型选择需要同时考虑准确性和稳定性。结果表明,RF建模对于包含许多重要程度从中到低的变量是鲁棒的。由于变量减少,我们发现交叉验证的准确性没有实质性的提高。此外,向后消除过程倾向于选择的变量太少,并且表现出许多问题,例如向上偏离袋外准确度估计和空间预测中的不稳定性。我们使用模拟来进一步支持和归纳来自真实数据分析的结果。这项工作的主要目的是向有兴趣使用RF开发具有大型环境数据集的预测模型的生态学家阐明模型选择偏差和不稳定性的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号