首页> 美国卫生研究院文献>Current Genomics >The Illusion of Distribution-Free Small-Sample Classification in Genomics
【2h】

The Illusion of Distribution-Free Small-Sample Classification in Genomics

机译:基因组学中无分布小样本分类的错觉

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.
机译:由于希望使用高通量基因组数据来区分表型,特别是疾病状况,分类已成为生物信息学的主要研究领域。尽管提出了许多分类规则,但是错误估计规则很少,并且关于错误估计准确性的理论还很少。这是有问题的,因为分类器的价值主要取决于其错误率。在生物信息学论文中,常见的做法是将分类规则应用到一个小的标签数据集,并在同一个数据集上估算得出的分类器的误差,通常是通过交叉验证,而无需对这些假设进行任何假设。基础特征标签分布。缺少分布假设的同时,也没有任何有关误差估计准确性的陈述。如果没有这种准确性的度量,最常见的度量是均方根(RMS),则误差估计基本上是没有意义的,整个论文的价值值得怀疑。在小样本环境中,可以确保缺少分布假设和误差估计精度的度量,这是因为即使存在无分布边界(这是罕见的),边界下所需的样本量仍然很大,以至于使它们对小样本无用。因此,分布边界是必要的,并且需要说明分布假设。由于分类器在认识论上依赖于估计误差的准确性,因此在高通量小样本生物学中具有科学意义的无分布分类是一种幻想。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号