首页> 外文期刊>American Journal of Epidemiology >Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.
【24h】

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

机译:使用MICE插补缺失数据的随机森林插补模型和参数插补模型的比较:CALIBER研究。

获取原文
获取原文并翻译 | 示例
       

摘要

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
机译:链式方程多元估算(MICE)通常用于估算流行病学研究中的缺失数据。 “真实”归因模型可能包含默认归因模型中未包括的非线性。随机森林插补是一种机器学习技术,它可以适应非线性和相互作用,并且不需要指定特定的回归模型。我们在2个模拟研究中将参数MICE与基于随机森林的MICE算法进行了比较。第一项研究使用了从CALIBER数据库(使用链接的定制研究和电子记录; 2001-2010年的心血管疾病研究; 2001-2010年)中的10128名稳定型心绞痛患者中抽取的2,000人的1000个随机样本,并包含所有协变量的完整数据。人为地使变量“随机丢失”,并比较了使用不同插补方法获得的参数估计值的偏差和效率。两种MICE方法均产生了(log)危险比的无偏估计,但随机森林更有效且置信区间更窄。第二项研究使用模拟数据,其中部分观测变量以非线性方式依赖于完全观测变量。使用随机森林MICE减少参数估计的偏差,并且置信区间覆盖率更好。这表明,随机森林插补对于插补某些患者缺少数据的复杂流行病学数据集可能有用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号