Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

Blaise Hanczar; Jianping Hua; Edward R Dougherty

首页> 外文期刊>EURASIP journal on bioinformatics and systems biology >Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

【24h】

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

机译：高维设置中真实和估计的分类器错误的解相关

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, -fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.

机译：许多微阵列实验的目的是建立区分性的诊断和预后模型。考虑到大量的功能和少量的示例，模型有效性（指误差估计的精度）是一个关键问题。以前的研究已经通过偏差分布（估计误差减去真实误差）解决了这个问题，特别是在使用特征选择来减轻峰值现象（过度拟合）的高维环境中，交叉验证精度的下降。由于分类器设计基于随机样本，因此真实误差和估计误差都是与样本相关的随机变量，如果估计误差和真实误差之间没有很好的相关性，则人们会期望精度下降，因此自然程度就产生了问题。相关性以及缺少相关性影响误差估计的方式。我们通过偏差分布方差的分解证明了相关性对误差精度的影响，观察到在高维环境中相关性通常会严重降低，并且表明高维性对误差估计的影响往往来自它的去相关效应要比它对估计误差方差的影响大。我们考虑了使用合成数据和真实数据在不同实验条件下真实误差和估计误差之间的相关性，几种特征选择方法，不同的分类规则以及三种常用的误差估计量（留一法交叉验证，-折交叉-validation和.632引导程序）。此外，考虑了三种情况：（1）特征选择;（2）已知特征集;（3）所有特征。只有第一个具有实际意义。但是，出于比较目的，还需要另外两个。我们将观察到，与已知的特征集相比，与选择特征或使用所有特征相比，真实错误和估计错误的相关性往往更高，后两者之间的相关性更好，没有总体趋势，但因不同而有所差异。楷模。

著录项

来源
《EURASIP journal on bioinformatics and systems biology》 |2007年第1期|共12页
作者
Blaise Hanczar; Jianping Hua; Edward R Dougherty;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词

相似文献

外文文献
中文文献
专利

1. Performance bounds for parameter estimates of high-dimensional linear models with correlated errors [J] . Wei-Biao Wu, Ying Nian Wu Electronic Journal of Statistics . 2016,第1期

机译：具有相关误差的高维线性模型参数估计的性能界限
2. High-Dimensional Quadratic Classifiers in Non-sparse Settings [J] . Aoshima Makoto, Yata Kazuyoshi Methodology and computing in applied probability . 2019,第3期

机译：非稀疏设置中的高维二次分类器
3. Scale adjustments for classifiers in high-dimensional, low sample size settings [J] . Yao-Ban Chan, Peter Hall Quality Control and Applied Statistics . 2010,第5a6期

机译：高维，低样本量设置中分类器的比例调整
4. Is There Correlation Between the Estimated and True Classification Errors in Small-Sample Settings? [C] . Hanczar, Blaise, Hua, . 2007

机译：小样本设置中的估计误差和真实分类误差之间是否存在相关性？
5. Novel true stress-true strain-birefringence measurement systems for real time measurement during multiaxial deformation and heat setting of polymer films: "Application on PET films". [D] . Hassan, Mohamed K. 2004

机译：用于聚合物膜多轴变形和热定型期间实时测量的新型真应力-真应变-双折射测量系统：“在PET膜上的应用”。
6. Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings [O] . Blaise Hanczar, Jianping Hua, Edward R Dougherty 2007

机译：高维设置中真实和估计的分类器错误的解相关
7. Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings [O] . Blaise Hanczar, Jianping Hua, Edward R. Dougherty 2007

机译：高维设置中真实和估计的分类器错误的解相关

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

摘要

著录项

相似文献

相关主题

期刊订阅