首页> 外文期刊>Bioinformatics >Reporting bias when using real data sets to analyze classification performance
【24h】

Reporting bias when using real data sets to analyze classification performance

机译:使用真实数据集分析分类效果时的报告偏差

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: It is commonplace for authors to propose a new classification rule, either the operator construction part or feature selection, and demonstrate its performance on real data sets, which often come from high-dimensional studies, such as from gene-expression microarrays, with small samples. Owing to the variability in feature selection and error estimation, individual reported performances are highly imprecise. Hence, if only the best test results are reported, then these will be biased relative to the overall performance of the proposed procedure.Results: This article characterizes reporting bias with several statistics and computes these statistics in a large simulation study using both modeled and real data. The results appear as curves giving the different reporting biases as functions of the number of samples tested when reporting only the best or second best performance. It does this for two classification rules, linear discriminant analysis (LDA) and 3-nearest-neighbor (3NN), and for filter and wrapper feature selection, t-test and sequential forward search. These were chosen on account of their well-studied properties and because they were amenable to the extremely large amount of processing required for the simulations. The results across all the experiments are consistent: there is generally large bias overriding what would be considered a significant performance differential, when reporting the best or second best performing data set. We conclude that there needs to be a database of data sets and that, for those studies depending on real data, results should be reported for all data sets in the database.
机译:动机:作者通常会提出一个新的分类规则,即算子构建部分或特征选择,并在真实数据集上展示其性能,这些数据通常来自于高维研究,例如基因表达微阵列。小样本。由于特征选择和错误估计的可变性,单个报告的性能非常不精确。因此,如果仅报告最佳测试结果,则相对于所建议程序的整体性能,这些结果将是有偏差的。结果:本文使用几种统计量来表征报告的偏倚,并在大型模拟研究中使用建模和实测来计算这些统计量数据。当显示仅报告最佳或次佳性能时,结果显示为曲线,给出了不同的报告偏差作为所测试样品数量的函数。它针对两个分类规则(线性判别分析(LDA)和3最近邻(3NN))以及过滤器和包装器特征选择,t检验和顺序正向搜索执行此操作。选择它们的原因是它们具有充分研究的特性,并且因为它们适合模拟所需的大量处理。所有实验的结果都是一致的:当报告最佳或次佳数据集时,通常存在较大的偏差,这些偏差会覆盖显着的性能差异。我们得出结论,需要有一个数据集数据库,对于那些依赖于真实数据的研究,应该报告数据库中所有数据集的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号