...
首页> 外文期刊>BMC Medical Research Methodology >Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction
【24h】

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

机译:在存在非正常,非线性和相互作用的存在下缺失数据的随机森林的归责的准确性

获取原文
           

摘要

Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.
机译:缺失数据在统计分析中是常见的,并且基于随机森林(RF)的估算方法正在成为处理缺失的数据,特别是在生物医学研究中的流行。与标准借调方法不同,基于RF的撤销方法不承担正常性或需要参数模型的规范。然而,它仍然是它们如何为非正常分布的数据执行或者存在非线性关系或交互时的不确定。为了检查这三种因素的效果,在随机(MAR)协变量中缺失的结果依赖于缺失的各种数据集,与预测平均匹配相比,评估了基于RF的载体方法的性能和CaliberRfimute(PMM )。 Misselest和Caliberrfimute都具有高的预测精度,但是错过的错过了偏差的回归系数估计和向下偏置的置信区间覆盖范围,特别是对于非线性模型中的高度偏斜变量。估计回归系数时,CaliberRfimute通常优于错过错过的错过,尽管其偏差仍然很大,但可能比PMM与交互的逻辑回归关系差。基于RF的贷款,特别是错过的错过终端,不应仅仅是抵抗缺失数据的灵丹妙药,特别是当数据高度偏斜和/或结果依赖于MAR时。正确的分析需要仔细批评缺失的数据机制和数据之间变量之间的相互关系。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号