Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

Shangzhi Hong; Henry S. Lynn

首页> 外文期刊>BMC Medical Research Methodology >Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

【24h】

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

机译：在存在非正常，非线性和相互作用的存在下缺失数据的随机森林的归责的准确性

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Missing data are common in statistical analyses, and imputation methods based on random forests (RF) are becoming popular for handling missing data especially in biomedical research. Unlike standard imputation approaches, RF-based imputation methods do not assume normality or require specification of parametric models. However, it is still inconclusive how they perform for non-normally distributed data or when there are non-linear relationships or interactions. To examine the effects of these three factors, a variety of datasets were simulated with outcome-dependent missing at random (MAR) covariates, and the performances of the RF-based imputation methods missForest and CALIBERrfimpute were evaluated in comparison with predictive mean matching (PMM). Both missForest and CALIBERrfimpute have high predictive accuracy but missForest can produce severely biased regression coefficient estimates and downward biased confidence interval coverages, especially for highly skewed variables in nonlinear models. CALIBERrfimpute typically outperforms missForest when estimating regression coefficients, although its biases are still substantial and can be worse than PMM for logistic regression relationships with interaction. RF-based imputation, in particular missForest, should not be indiscriminately recommended as a panacea for imputing missing data, especially when data are highly skewed and/or outcome-dependent MAR. A correct analysis requires a careful critique of the missing data mechanism and the inter-relationships between the variables in the data.

机译：缺失数据在统计分析中是常见的，并且基于随机森林（RF）的估算方法正在成为处理缺失的数据，特别是在生物医学研究中的流行。与标准借调方法不同，基于RF的撤销方法不承担正常性或需要参数模型的规范。然而，它仍然是它们如何为非正常分布的数据执行或者存在非线性关系或交互时的不确定。为了检查这三种因素的效果，在随机（MAR）协变量中缺失的结果依赖于缺失的各种数据集，与预测平均匹配相比，评估了基于RF的载体方法的性能和CaliberRfimute（PMM ）。 Misselest和Caliberrfimute都具有高的预测精度，但是错过的错过了偏差的回归系数估计和向下偏置的置信区间覆盖范围，特别是对于非线性模型中的高度偏斜变量。估计回归系数时，CaliberRfimute通常优于错过错过的错过，尽管其偏差仍然很大，但可能比PMM与交互的逻辑回归关系差。基于RF的贷款，特别是错过的错过终端，不应仅仅是抵抗缺失数据的灵丹妙药，特别是当数据高度偏斜和/或结果依赖于MAR时。正确的分析需要仔细批评缺失的数据机制和数据之间变量之间的相互关系。

著录项

来源
《BMC Medical Research Methodology》 |2020年第1期|共12页
作者
Shangzhi Hong; Henry S. Lynn;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词
Missing data imputationImputation accuracyRandom forest;

机译：缺少数据推出累积血液森林;

相似文献

外文文献
中文文献
专利

1. Recursive partitioning for missing data imputation in the presence of interaction effects [J] . L.L. Doove, S. Van Buuren, E. Dusseldorp Computational statistics & data analysis . 2014,第Null期

机译：在存在交互作用的情况下对丢失的数据插补进行递归分区
2. Hisreporting, Missing Data, and Multiple Imputation: Improving Accuracy of Cancer Registry Databases [J] . Yulei He, Recai Yucel, AlanM Zaslavsky Chance . 2008,第3期

机译：他的报告，数据丢失和多重插补：提高癌症注册数据库的准确性
3. Misreporting, missing data, and multiple imputation: Improving accuracy of cancer registry databases [J] . CHANCE . 2008,第3期

机译：误报，数据丢失和多重估算：提高癌症登记数据库的准确性
4. Missing value imputation methods for TCM medical data and its effect in the classifier accuracy [C] . Dan Zeng, Dan Xie, Ran Liu, IEEE International Conference on e-Health Networking, Applications and Services . 2017

机译：中医医学数据的缺失值插补方法及其对分类器准确性的影响
5. Some Contributions to Multivariate Non-Normality: Simulation, Computations and Missing Data Imputation [D] . Lun, Zhixin. 2020

机译：多元非正常性的一些贡献：模拟，计算和缺少数据归档
6. Accuracy of random-forest-based imputation of missing data in the presence of non-normality non-linearity and interaction [O] . Shangzhi Hong, Henry S. Lynn 2020

机译：在存在非正常非线性和相互作用的存在下缺失数据的随机森林的归责的准确性
7. The importance of Genetic Relationships and Phenotypic Record on Genomic Accuracy of Simulated Imputation Data Via Animal Models in Presence of Genotype × Environment Interactions [O] . Yousef Naderi 2018

机译：基因型×环境相互作用存在于基因型×环境相互作用的血管模型遗传关系和表型记录对模拟估算数据的基因组准确性

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction

摘要

著录项

相似文献

相关主题

期刊订阅