Reporting bias when using real data sets to analyze classification performance

Yousefi, Mohammadmahdi R.; Hua, Jianping; Sima, Chao; Dougherty, Edward R.

首页> 外文期刊>Bioinformatics >Reporting bias when using real data sets to analyze classification performance

【24h】

Reporting bias when using real data sets to analyze classification performance

机译：使用真实数据集分析分类效果时的报告偏差

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Motivation: It is commonplace for authors to propose a new classification rule, either the operator construction part or feature selection, and demonstrate its performance on real data sets, which often come from high-dimensional studies, such as from gene-expression microarrays, with small samples. Owing to the variability in feature selection and error estimation, individual reported performances are highly imprecise. Hence, if only the best test results are reported, then these will be biased relative to the overall performance of the proposed procedure.Results: This article characterizes reporting bias with several statistics and computes these statistics in a large simulation study using both modeled and real data. The results appear as curves giving the different reporting biases as functions of the number of samples tested when reporting only the best or second best performance. It does this for two classification rules, linear discriminant analysis (LDA) and 3-nearest-neighbor (3NN), and for filter and wrapper feature selection, t-test and sequential forward search. These were chosen on account of their well-studied properties and because they were amenable to the extremely large amount of processing required for the simulations. The results across all the experiments are consistent: there is generally large bias overriding what would be considered a significant performance differential, when reporting the best or second best performing data set. We conclude that there needs to be a database of data sets and that, for those studies depending on real data, results should be reported for all data sets in the database.

机译：动机：作者通常会提出一个新的分类规则，即算子构建部分或特征选择，并在真实数据集上展示其性能，这些数据通常来自于高维研究，例如基因表达微阵列。小样本。由于特征选择和错误估计的可变性，单个报告的性能非常不精确。因此，如果仅报告最佳测试结果，则相对于所建议程序的整体性能，这些结果将是有偏差的。结果：本文使用几种统计量来表征报告的偏倚，并在大型模拟研究中使用建模和实测来计算这些统计量数据。当显示仅报告最佳或次佳性能时，结果显示为曲线，给出了不同的报告偏差作为所测试样品数量的函数。它针对两个分类规则（线性判别分析（LDA）和3最近邻（3NN））以及过滤器和包装器特征选择，t检验和顺序正向搜索执行此操作。选择它们的原因是它们具有充分研究的特性，并且因为它们适合模拟所需的大量处理。所有实验的结果都是一致的：当报告最佳或次佳数据集时，通常存在较大的偏差，这些偏差会覆盖显着的性能差异。我们得出结论，需要有一个数据集数据库，对于那些依赖于真实数据的研究，应该报告数据库中所有数据集的结果。

著录项

来源
《Bioinformatics》 |2010年第1期|共9页
作者
Yousefi, Mohammadmahdi R.; Hua, Jianping; Sima, Chao; Dougherty, Edward R.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物工程学（生物技术）;
关键词
Bioinformatics; Data processing; Databases; Filters; Gene expression; Statistical analysis;

机译：生物信息学;数据处理;数据库;过滤器;基因表达;统计分析;

相似文献

外文文献
中文文献
专利

1. Reporting bias when using real data sets to analyze classification performance [J] . Yousefi, Mohammadmahdi R., Hua, Jianping, Sima, Chao, Bioinformatics . 2010,第1期

机译：使用真实数据集分析分类效果时的报告偏差
2. Selection bias in the reported performances of AD classification pipelines [J] . Alex F. Mendelson, Maria A. Zuluaga, Marco Lorenzi, NeuroImage: Clinical . 2017,第3期

机译：广告分类管道报告的性能中的选择偏差
3. Non-steroidal anti-inflammatory drugs and their benefits and harms: the challenge of interpreting meta-analyses and observational data sets when balanced data are not analyzed and reported [J] . Lee S. Simon Arthritis research & therapy. . 2015,第Mara期

机译：非甾体类抗炎药及其益处和危害：在未分析和报告平衡数据时解释元分析和观测数据集的挑战
4. Classification performance of various real-life data sets when the features are discretized [C] . Lynch, R.S., Jr., . 2001

机译：离散化特征后各种现实数据集的分类性能
5. High Performance Computing and Real Time Software for High Dimensional Data Classification [D] . Meng, Zhaoyi. 2018

机译：高性能计算和实时软件，用于高维数据分类
6. Selection bias in the reported performances of AD classification pipelines [O] . Alex F. Mendelson, Maria A. Zuluaga, Marco Lorenzi, 2017

机译：广告分类管道报告的性能中的选择偏差
7. Reporting bias when using real data sets to analyze classification performance [O] . M. R. Yousefi, J. Hua, C. Sima, 2009

机译：使用真实数据集进行分析分类性能时报告偏差

Reporting bias when using real data sets to analyze classification performance

摘要

著录项

相似文献

相关主题

期刊订阅