...
首页> 外文期刊>Pattern Recognition: The Journal of the Pattern Recognition Society >Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy
【24h】

Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

机译:蛋白质序列和结构域结构类别的预测-预测算法,序列表示和同源性以及测试程序对准确性的影响

获取原文
获取原文并翻译 | 示例
           

摘要

This paper addresses computational prediction of protein structural classes. Although in recent years progress in this field was made, the main drawback of the published prediction methods is a limited scope of comparison procedures, which in same cases were also improperly performed. Two examples include using protein datasets of varying homology, which has significant impact on the prediction accuracy, and comparing methods in pairs using different datasets. Based on extensive experimental work, the main aim of this paper is to revisit and reevaluate state of the art in this field. To this end, this paper performs a first-of-its-kind comprehensive and multi-goal study, which includes investigation of eight prediction algorithms, three protein sequence representations, three datasets with different homologies and finally three test procedures. Quality of several previously unused prediction algorithms, newly proposed sequence representation, and a new-to-the-field testing procedure is evaluated. Several important conclusions and findings are made. First, the logistic regression classifier, which was not previously used, is shown to perform better than other prediction algorithms, and high quality of previously used support vector machines is confirmed. The results also show that the proposed new sequence representation improves accuracy of the high quality prediction algorithms, while it does not improve results of the lower quality classifiers. The study shows that commonly used jackknife test is computationally expensive, and therefore computationally less demanding 10-fold cross-validation procedure is proposed. The results show that there is no statistically significant difference between these two procedures. The experiments show that sequence homology has very significant impact on the prediction accuracy, i.e. using highly homologous datasets results in higher accuracies. Thus, results of several past studies that use homologous datasets should not be perceived as reliable. The best achieved prediction accuracy for low homology datasets is about 57% and confirms results reported by Wang and Yuan [How good is the prediction of protein structural class by the component-coupled method?. Proteins 2000;38:165-175]. For a highly homologous dataset instance based classification is shown to be better than the previously reported results. It achieved 97% prediction accuracy demonstrating that homology is a major factor that can result in the overestimated prediction accuracy. (0 2006 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
机译:本文介绍蛋白质结构类别的计算预测。尽管近年来在该领域取得了进展,但是已发布的预测方法的主要缺点是比较程序的范围有限,在相同情况下,比较程序也无法正确执行。两个示例包括使用具有不同同源性的蛋白质数据集,这会对预测准确性产生重大影响,并使用不同的数据集成对比较方法。基于大量的实验工作,本文的主要目的是重新审视和重新评估该领域的技术水平。为此,本文进行了首次全面而多目标的研究,其中包括对八个预测算法,三个蛋白质序列表示形式,三个具有不同同源性的数据集以及最后三个测试过程的研究。评估了几种先前未使用的预测算法,新提出的序列表示和新领域测试程序的质量。得出了一些重要的结论和发现。首先,先前未使用过的逻辑回归分类器表现出比其他预测算法更好的性能,并且证实了先前使用的支持向量机的高质量。结果还表明,所提出的新序列表示提高了高质量预测算法的准确性,而没有改善较低质量分类器的结果。研究表明,常用的折刀测试在计算上是昂贵的,因此提出了对计算的要求不高的10倍交叉验证程序。结果表明这两种方法之间没有统计学上的显着差异。实验表明,序列同源性对预测准确性有非常重要的影响,即,使用高度同源的数据集可提高准确性。因此,过去使用同源数据集的一些研究结果不应被认为是可靠的。低同源性数据集的最佳预测准确度约为57%,并证实了Wang和Yuan报道的结果[通过成分耦合法对蛋白质结构分类的预测有多好?蛋白质2000; 38:165-175]。对于高度同源的数据集,基于实例的分类显示出比以前报告的结果更好。它达到了97%的预测准确度,表明同源性是可能导致高估预测准确度的主要因素。 (0 2006年模式识别协会。由Elsevier Ltd.出版。保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号