Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

机译：量化特征选择对交叉验证误差估计方差的影响

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Given the relatively small number of microarrays typically used in gene-expression-based classification, all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy, and this is most directly analyzed via the deviation distribution of the estimator, this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features, cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end, we propose the coefficient of relative increase in deviation dispersion (CRIDD), which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant, contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis, 3-nearest-neighbor, and linear support vector machines for classification; sequential forward selection, sequential forward floating selection, and the -test for feature selection; and -fold and leave-one-out cross-validation for error estimation. We apply these to three feature-label models and patient data from a breast cancer study. In sum, the cross-validation deviation distribution is significantly flatter when there is feature selection, compared with the case when cross-validation is performed on a given feature set. This is reflected by the observed positive values of the CRIDD, which is defined to quantify the contribution of feature selection towards the deviation variance.

机译：鉴于通常在基于基因表达的分类中通常使用的微阵列数量相对较少，所有数据必须用于训练分类器，因此，将相同的训练数据用于误差估计。在小样本情况下，有关误差估计器质量的关键问题是其准确性，这是通过估计器的偏差分布最直接地分析的，这是估计误差与真实误差之间的差异的分布。过去的研究表明，给定一组先前的功能，交叉验证在这方面的表现不如其他一些基于训练数据的误差估计器。这项研究的目的是量化特征选择在没有特征选择的情况下增加偏差分布变化的程度。为此，我们提出了偏差离散度的相对增加系数（CRIDD），它给出了使用特征选择而不是使用没有特征选择的最佳特征集时偏差分布方差的相对增加。特征选择对偏差分布的方差的贡献可能很大，在许多研究的案例中，方差占一半以上。我们考虑线性判别分析，3-最近邻和线性支持向量机进行分类。顺序向前选择，顺序向前浮动选择和-test用于特征选择；折叠和留一法交叉验证用于错误估计。我们将这些应用于三个特征标签模型以及来自乳腺癌研究的患者数据。总之，与对给定特征集执行交叉验证的情况相比，选择特征时交叉验证偏差分布明显更平坦。 CRIDD的观察到的正值反映了这一点，该值定义为量化特征选择对偏差方差的贡献。

著录项

期刊名称 EURASIP Journal on Bioinformatics and Systems Biology
作者
Yufei Xiao; Jianping Hua; Edward R Dougherty;
展开▼
作者单位

展开▼
年(卷),期 2007(2007),1
年度 2007
页码 16354
总页数 11
原文格式 PDF
正文语种
中图分类生物学;
关键词

相似文献

外文文献
中文文献
专利

1. Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation [J] . Yufei Xiao, Jianping Hua, Edward R Dougherty EURASIP journal on bioinformatics and systems biology . 2007,第1期

机译：特征选择对交叉验证误差估计方差的影响的量化
2. Simultaneous estimation of cross-validation errors in least squares collocation applied for statistical testing and evaluation of the noise variance components [J] . Behnabian Behzad, Hossainali Masoud Mashhadi, Malekzadeh Ahad Journal of Geodesy . 2018,第11期

机译：最小二乘搭配的交叉验证错误的同时估计用于统计测试和噪声方差分量的评估
3. Minimization and estimation of the variance of prediction errors for cross-validation designs [J] . Mathias Fuchs, Norbert Krautenbacher Journal of statistical theory and practice . 2016,第1a2期

机译：交叉验证设计的预测误差方差的最小化和估计
4. Watermelon: a Novel Feature Selection Method Based on Bayes Error Rate Estimation and a New Interpretation of Feature Relevance and Redundancy [C] . Xiang Xie, Wilhelm Stork International Conference on Pattern Recognition . 2021

机译：西瓜：一种基于贝叶斯误差率估计的新颖特征选择方法和特征相关性和冗余的新解释
5. Feature Selection and Classification for High-Dimensional Biological Data Under Cross-Validation Framework [D] . Zhong, Yi. 2018

机译：交叉验证框架下高维生物数据的特征选择与分类
6. Bias in error estimation when using cross-validation for model selection [O] . Sudhir Varma, Richard Simon 2006

机译：使用交叉验证进行模型选择时的误差估计偏差
7. Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation [O] . 2007

机译：量化特征选择对交叉验证误差估计方差的影响

Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation

摘要

著录项

相似文献

相关主题

期刊订阅