首页> 外文期刊>Bioinformatics >Is cross-validation better than resubstitution for ranking genes?
【24h】

Is cross-validation better than resubstitution for ranking genes?

机译:交叉验证是否比重新替代基因排名更好?

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance. Results: A model-based approach is used to compare the ranking performances of resubstitution and cross-validation for classification based on real-valued feature sets and for prediction in the context of probabilistic Boolean networks (PBNs). For classification, a Gaussian model is considered, along with classification via linear discriminant analysis and the 3-nearest-neighbor classification rule. Prediction is examined in the steady-distribution of a PBN. Three metrics are proposed to compare feature-set ranking based on error estimation with ranking based on the true error, which is known owing to the model-based approach. In all cases, resubstitution is competitive with cross-validation relative to ranking accuracy. This is in addition to the enormous savings in computation time afforded by resubstitution.
机译:动机:对基因特征集进行排名是表型分类(例如,DNA芯片实验中的肿瘤分类)和基因调控网络中的预测的关键问题。有两种广泛的方法可用于估计分类器的错误(错误分类率)。重新替换使单个分类器适合数据,然后将该分类器依次应用于每个数据观察。交叉验证(采用留一法的形式)依次删除每个观察值,构造分类器,然后计算此留一法分类器是否正确分类了删除的观察值。重新替换通常会低估分类器错误,在许多情况下,严重地低估了分类器错误。交叉验证的优点是可以产生有效的无偏误差估计,但该估计变化很大。在许多应用中,关注的不是本身的错误分类率,而是具有可能分类或预测的基因集的构建。因此,需要根据特征的性能对特征集进行排名。结果:基于模型的方法用于比较重新替换和交叉验证的排序性能,以基于实值特征集进行分类,并在概率布尔网络(PBN)的上下文中进行预测。为了进行分类,考虑了高斯模型,以及通过线性判别分析和3最近邻分类规则进行的分类。在PBN的稳定分布中检查预测。提出了三个度量,以比较基于误差估计的特征集评级与基于真实误差的评级,这是基于模型的方法而众所周知的。在所有情况下,相对于排名准确性,重新替换都具有交叉验证的竞争力。此外,通过重新替换可以节省大量的计算时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号