A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Roman Hornung; Christoph Bernau; Caroline Truntzer; Rory Wilson; Thomas Stadler; Anne-Laure Boulesteix

首页> 外文期刊>BMC Medical Research Methodology >A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

【24h】

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

机译：CV不完整对预测误差估计的影响的量度（应用于PCA和归一化）

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. Methods We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. Results Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. Conclusions While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.

机译：背景技术在生物医学领域中监督统计学习的应用中，有必要评估各个预测规则的预测误差。通常，在通过交叉验证（CV）基于训练/测试集的预测误差估计之前，对数据集整体执行数据准备步骤，这种方法称为“不完全CV”。不完整的CV是否会导致乐观的误差估计取决于正在考虑的数据准备步骤。几项实证研究已经研究了在CV之前通过进行初步监督变量选择而引起的偏差程度。据我们所知，文献中尚未研究其他数据准备步骤引起的潜在偏差。在本文中，我们针对两个常见的数据准备步骤研究了这种偏差：归一化和主成分分析，以减少协变量空间（PCA）的维数。此外，我们获得了以下步骤的初步结果：优化调整参数，通过方差进行变量过滤和插补缺失值。方法我们设计了易于解释的通用度量CVIIM（“ CV不完整影响度量”），以量化由不完整CV引起的有关数据准备步骤的偏差程度。此措施可用于确定特定的数据准备步骤是否应作为一般规则在每次CV迭代中执行，或者在实践中是否可接受不完整的CV程序。我们将CVIIM应用于大量的微阵列数据集，以回答此问题以进行标准化和PCA。结果在CV之前对整个数据集执行归一化处理不会在任何调查的案例中引起明显的乐观偏差。相反，在CV之前执行PCA时，在多个设置中观察到了中等到强烈的预测误差低估。结论虽然可以安全地在CV之前安全地执行所研究的标准化形式，但必须在每个CV分割中重新执行PCA，以防止出现乐观偏见。

著录项

来源
《BMC Medical Research Methodology》 |2015年第1期|共页
作者
Roman Hornung; Christoph Bernau; Caroline Truntzer; Rory Wilson; Thomas Stadler; Anne-Laure Boulesteix;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类医药、卫生;
关键词

相似文献

外文文献
中文文献
专利

1. Prediction Error Prioritizing Strategy for Fast Normalized Partial Distortion Motion Estimation Algorithm [J] . Yang C.-C., Li G.-L., Chi M.-C., Circuits and Systems for Video Technology, IEEE Transactions on . 2010,第8期

机译：快速归一化局部失真运动估计算法的预测误差优先策略
2. The impact of missing measurements on PCA and PLS prediction and monitoring applications [J] . Philip R.C. Nelson, John F. MacGregor, Paul A. Taylor Chemometrics and Intelligent Laboratory Systems . 2006,第1期

机译：缺少测量对PCA和PLS预测和监视应用程序的影响
3. On conditional prediction errors in mixed models with application to small area estimation [J] . Sugasawa Shonosuke, Kubokawa Tatsuya Journal of Multivariate Analysis: An International Journal . 2016,第Null期

机译：混合模型中的条件预测误差及其在小面积估计中的应用
4. Robust non-rigid image registration using incomplete information, adaptive normalized convolution and similarity measure combination [C] . Zien Zhou, Binjie Qin, NSS IEEE Nuclear Science Symposium . 2007

机译：使用不完整信息，自适应归一化卷积和相似度测量组合的强大的非刚性图像配准
5. Model error estimation in composite impact response prediction using hierarchical Bayes networks. [D] . Salas Mendez, Pablo Antonio. 2010

机译：使用分级贝叶斯网络的复合冲击响应预测中的模型误差估计。
6. A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization [O] . Roman Hornung, Christoph Bernau, Caroline Truntzer, 2015

机译：CV不完整对预测误差估计的影响的量度（应用于PCA和归一化）
7. A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization [O] . Roman Hornung, Christoph Bernau, Caroline Truntzer, 2015

机译：CV不完整对预测误差估计的影响的量度（应用于PCA和归一化）
8. Errors in audit predictions of commercial lighting and equipment loads and in their impacts on heating and cooling estimates. [R] . R. G. Pratt 1990

机译：商业照明和设备负荷的审计预测错误及其对加热和冷却估算的影响。

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

摘要

著录项

相似文献

相关主题

期刊订阅