...
首页> 外文期刊>EURASIP journal on bioinformatics and systems biology >Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings
【24h】

Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings

机译:高维设置中真实和估计的分类器错误的解相关

获取原文
   

获取外文期刊封面封底 >>

       

摘要

The aim of many microarray experiments is to build discriminatory diagnosis and prognosis models. Given the huge number of features and the small number of examples, model validity which refers to the precision of error estimation is a critical issue. Previous studies have addressed this issue via the deviation distribution (estimated error minus true error), in particular, the deterioration of cross-validation precision in high-dimensional settings where feature selection is used to mitigate the peaking phenomenon (overfitting). Because classifier design is based upon random samples, both the true and estimated errors are sample-dependent random variables, and one would expect a loss of precision if the estimated and true errors are not well correlated, so that natural questions arise as to the degree of correlation and the manner in which lack of correlation impacts error estimation. We demonstrate the effect of correlation on error precision via a decomposition of the variance of the deviation distribution, observe that the correlation is often severely decreased in high-dimensional settings, and show that the effect of high dimensionality on error estimation tends to result more from its decorrelating effects than from its impact on the variance of the estimated error. We consider the correlation between the true and estimated errors under different experimental conditions using both synthetic and real data, several feature-selection methods, different classification rules, and three error estimators commonly used (leave-one-out cross-validation, -fold cross-validation, and .632 bootstrap). Moreover, three scenarios are considered: (1) feature selection, (2) known-feature set, and (3) all features. Only the first is of practical interest; however, the other two are needed for comparison purposes. We will observe that the true and estimated errors tend to be much more correlated in the case of a known feature set than with either feature selection or using all features, with the better correlation between the latter two showing no general trend, but differing for different models.
机译:许多微阵列实验的目的是建立区分性的诊断和预后模型。考虑到大量的功能和少量的示例,模型有效性(指误差估计的精度)是一个关键问题。以前的研究已经通过偏差分布(估计误差减去真实误差)解决了这个问题,特别是在使用特征选择来减轻峰值现象(过度拟合)的高维环境中,交叉验证精度的下降。由于分类器设计基于随机样本,因此真实误差和估计误差都是与样本相关的随机变量,如果估计误差和真实误差之间没有很好的相关性,则人们会期望精度下降,因此自然程度就产生了问题。相关性以及缺少相关性影响误差估计的方式。我们通过偏差分布方差的分解证明了相关性对误差精度的影响,观察到在高维环境中相关性通常会严重降低,并且表明高维性对误差估计的影响往往来自它的去相关效应要比它对估计误差方差的影响大。我们考虑了使用合成数据和真实数据在不同实验条件下真实误差和估计误差之间的相关性,几种特征选择方法,不同的分类规则以及三种常用的误差估计量(留一法交叉验证,-折交叉-validation和.632引导程序)。此外,考虑了三种情况:(1)特征选择;(2)已知特征集;(3)所有特征。只有第一个具有实际意义。但是,出于比较目的,还需要另外两个。我们将观察到,与已知的特征集相比,与选择特征或使用所有特征相比,真实错误和估计错误的相关性往往更高,后两者之间的相关性更好,没有总体趋势,但因不同而有所差异。楷模。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号