首页> 外文期刊>BMC Bioinformatics >Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions
【24h】

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

机译:数据集大小和组成对肽-MHC绑定预测的性能基准的可靠性影响

获取原文
获取外文期刊封面目录资料

摘要

Background It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. Results We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. Conclusion It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.
机译:背景技术重要的是准确地确定肽的性能:MHC绑定预测,因为这使得用户能够在不同预测方法之间进行比较和选择并提供预期误差率的估计。确定预测性能的两个常见方法是交叉验证,其中所有可用数据都被迭代地分成训练和测试数据,并且使用与用于构造预测方法的数据分开生成的盲集。在本研究中,我们已经比较了从2009年在我们的最后一个基准数据集上生成的交叉验证的预测性能,其中在随后添加到用于盲集的免疫表位数据库(IEDB)上的数据上产生的预测性能。结果我们发现交叉验证的性能系统地高估在盲套上的性能。发现这不是由于交叉验证数据集中存在类似的肽。相反,我们发现训练或盲数据集的小尺寸和低序列/亲和分集与交叉验证的与盲预测性能的大差异有关。我们使用这些发现来定量规则,这些规则需要大量和多样化的数据集,以提供更广泛的性能估计。结论已经众所周知,交叉验证的预测性能估计通常在独立产生的盲集数据上估计性能。我们在这里识别并量化了对MHC-I结合预测的这种效果有助于这种效果的具体因素。基于绑定预测,选择了测量MHC结合亲和力的缺乏数量的肽,因此比采样整个序列和亲层空间的历史数据集,从而不太多样化,使得它们更加困难的基准数据集。在比较不同基准之间的性能指标时,必须考虑这一点,以及根据基准性能导出用于预测的错误估计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号