...
首页> 外文期刊>BMC Bioinformatics >Bias in random forest variable importance measures: Illustrations, sources and a solution
【24h】

Bias in random forest variable importance measures: Illustrations, sources and a solution

机译:森林随机变量重要性衡量中的偏见:插图,来源和解决方案

获取原文
           

摘要

Background Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.
机译:背景技术在生物信息学和相关科学领域的许多分类任务中,作为随机选择的变量选择手段的一种,对于随机森林的可变重要性度量已受到越来越多的关注,例如,选择与预测某种疾病相关的遗传标记子集。我们表明,随机森林变量重要性度量是在许多应用中进行变量选择的明智方法,但在潜在预测变量的测量规模或类别数量变化的情况下并不可靠。这在基因组学和计算生物学中尤其重要,其中预测变量通常包括不同类型的变量,例如,当预测变量既包含序列数据又包含连续变量(例如折叠能量)时,或者当氨基酸序列数据显示不同类别的数量时。结果仿真研究表明,当随机森林变量重要性度量与不同类型的数据一起使用时,结果会产生误导,因为在变量选择中人为地偏爱了最佳预测变量。造成这种缺陷的两个机制是,一方面,用于构建随机森林的各个分类树中的变量选择有偏见;另一方面,自举采样和替换引起的影响。结论我们建议采用随机森林的另一种实现方式,该方法在各个分类树中提供无偏变量选择。当使用这种方法进行二次抽样而不进行替换时,即使潜在的预测变量的测量规模或类别数量发生变化,所得到的变量重要性度量也可以可靠地用于变量选择。在重新分析来自RNA编辑研究的数据的应用程序中,对随机森林算法及其可变重要性度量在R系统中用于统计计算的用法进行了说明和详细记录。因此,建议的方法可以被科学家直接用于生物信息学研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号