首页> 外文期刊>Bioinformatics >The use of gene ontology evidence codes in preventing classifier assessment bias
【24h】

The use of gene ontology evidence codes in preventing classifier assessment bias

机译:基因本体证据代码在防止分类器评估偏差中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

MOTIVATION: The biological community's reliance on computational annotations of protein function makes correct assessment of function prediction methods an issue of great importance. The fact that a large fraction of the annotations in current biological databases are based on computational methods can lead to bias in estimating the accuracy of function prediction methods. This can happen since predicting an annotation that was derived computationally in the first place is likely easier than predicting annotations that were derived experimentally, leading to over-optimistic classifier performance estimates. RESULTS: We illustrate this phenomenon in a set of controlled experiments using a nearest neighbor classifier that uses PSI-BLAST similarity scores. Our results demonstrate that the source of Gene Ontology (GO) annotations used to assess a protein function predictor can have a highly significant influence on classifier accuracy: the average accuracy over four species and over GO terms in the biological process namespace increased from 0.72 to 0.87 when the classifier was given access to annotations that are assigned evidence codes that indicate a possible computational source, instead of experimentally determined annotations. Slightly smaller increases were observed in the other namespaces. In these comparisons the total number of annotations and their distribution across GO terms were kept the same. CONCLUSION: In conclusion, taking into account GO evidence codes is required for reporting accuracy statistics that do not overestimate a model's performance, and is of particular importance for a fair comparison of classifiers that rely on different information sources. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
机译:动机:生物界对蛋白质功能的计算注释的依赖使对功能预测方法的正确评估成为一个非常重要的问题。当前生物学数据库中大部分注释基于计算方法的事实可能导致在估计功能预测方法的准确性方面出现偏差。之所以会发生这种情况,是因为预测最初通过计算得出的注释比预测通过实验得出的注释更容易,从而导致分类器性能估计值过于乐观。结果:我们在使用PSI-BLAST相似性评分的最近邻分类器的一组受控实验中说明了这种现象。我们的结果表明,用于评估蛋白质功能预测因子的基因本体论(GO)注释来源可能对分类器准确性产生重大影响:生物过程名称空间中四种和GO术语的平均准确性从0.72增加到0.87当为分类器提供访问注释的权限时,该注释将分配有指示可能的计算源的证据代码,而不是通过实验确定的注释。在其他命名空间中,观察到的增长略小。在这些比较中,注释的总数及其在GO术语中的分布保持不变。结论:总而言之,报告准确性统计数据时需要考虑GO证据代码,而统计数据不能过高估计模型的性能,这对于公平比较依赖于不同信息源的分类器尤为重要。补充信息:补充数据可从Bioinformatics在线获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号