Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Yohan Kim; John Sidney; S?ren Buus; Alessandro Sette; Morten Nielsen; Bjoern Peters

首页> 外文期刊>BMC Bioinformatics >Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

【24h】

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

机译：数据集大小和组成对肽-MHC绑定预测的性能基准的可靠性影响

获取原文

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Background It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. Results We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. Conclusion It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

机译：背景技术重要的是准确地确定肽的性能：MHC绑定预测，因为这使得用户能够在不同预测方法之间进行比较和选择并提供预期误差率的估计。确定预测性能的两个常见方法是交叉验证，其中所有可用数据都被迭代地分成训练和测试数据，并且使用与用于构造预测方法的数据分开生成的盲集。在本研究中，我们已经比较了从2009年在我们的最后一个基准数据集上生成的交叉验证的预测性能，其中在随后添加到用于盲集的免疫表位数据库（IEDB）上的数据上产生的预测性能。结果我们发现交叉验证的性能系统地高估在盲套上的性能。发现这不是由于交叉验证数据集中存在类似的肽。相反，我们发现训练或盲数据集的小尺寸和低序列/亲和分集与交叉验证的与盲预测性能的大差异有关。我们使用这些发现来定量规则，这些规则需要大量和多样化的数据集，以提供更广泛的性能估计。结论已经众所周知，交叉验证的预测性能估计通常在独立产生的盲集数据上估计性能。我们在这里识别并量化了对MHC-I结合预测的这种效果有助于这种效果的具体因素。基于绑定预测，选择了测量MHC结合亲和力的缺乏数量的肽，因此比采样整个序列和亲层空间的历史数据集，从而不太多样化，使得它们更加困难的基准数据集。在比较不同基准之间的性能指标时，必须考虑这一点，以及根据基准性能导出用于预测的错误估计。

著录项

来源
《BMC Bioinformatics》 |2014年第1期|共页
作者
Yohan Kim; John Sidney; S?ren Buus; Alessandro Sette; Morten Nielsen; Bjoern Peters;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Improving peptide-MHC class I binding prediction for unbalanced datasets [J] . Ana Paula Sales, Georgia D Tomaras, Thomas B Kepler BMC Bioinformatics . 2008,第1期

机译：针对不平衡数据集改善肽-MHC I类结合预测
2. Automated benchmarking of peptide-MHC class I binding predictions [J] . Trolle Thomas, Metushi Imir G., Greenbaum Jason A., Bioinformatics . 2015,第13期

机译：肽-MHC I类结合预测的自动基准测试
3. Benchmark datasets incorporating diverse tasks, sample sizes, material systems, and data heterogeneity for materials informatics [J] . Ashley N. Henderson, Steven K. Kauwe, Taylor D. Sparks Data in Brief . 2021,第a期

机译：基准数据集采用不同的任务，样本尺寸，材料系统和材料信息学的数据异质性
4. Effects of Distance between Classes and Training Datasets Size to the Performance of XCS: Case of Imbalance Datasets [C] . Thach H. Nguyen, Sombut Foitong, Somchai Udomthanapong, International MultiConfernece of Engineers and Computer Scientists . 2007

机译：类与训练数据集大小与XCS性能的距离的影响：不平衡数据集的情况
5. Computational prediction of peptide-MHC class I binding interactions. [D] . Bui, Huynh-Hoa Thi. 2003

机译：肽-MHC I类结合相互作用的计算预测。
6. Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions [O] . Yohan Kim, John Sidney, Søren Buus, 2014

机译：数据集的大小和组成会影响肽-MHC结合预测的性能基准的可靠性
7. Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions [O] . Yohan Kim, John Sidney, Søren Buus, 2014

机译：数据集的大小和组成会影响肽-MHC结合预测的性能基准的可靠性
8. Performance Predictions for an Intermediate-Sized VAWT (Vertical Axis Wind Turbine) Based on Performance of the 34-M VAWT Test Bed. [R] . Dodd, H. M. 1989

机译：基于34-m VaWT试验台性能的中型VaWT（垂直轴风力发电机）性能预测。

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅