...
首页> 外文期刊>BMC Bioinformatics >The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation
【24h】

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

机译:PIPA的开发:用于全基因组蛋白功能注释的集成自动化流程

获取原文

摘要

Background Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities. Results PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases. PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA. We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%). Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used. Conclusion The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.
机译:背景技术需要自动化的蛋白质功能预测方法来跟上高通量测序的步伐。由于存在许多用于推断不同蛋白质功能的程序和数据库,因此正确集成这些资源的管道将受益于每种方法的优势。但是,集成系统通常不提供生成定制数据库以预测特定蛋白质功能的机制。在这里,我们描述了一种具有这些功能的工具,称为PIPA(蛋白质注释管道)。结果PIPA通过将多个程序和数据库(例如InterPro和Conserved Domains Database)的结果组合为通用的基因本体论(GO)术语来注释蛋白质功能。 PIPA中实现的主要算法是:(1)配置文件数据库生成算法,该算法生成定制的配置文件数据库以预测特定的蛋白质功能;(2)自动本体映射生成算法,将各种分类方案映射到GO中;以及(3)一种共识算法,用于协调来自集成程序和数据库的注释。 PIPA的配置文件生成算法用于构建酶配置文件数据库CatFam,该数据库可预测由酶委员会(EC)编号描述的催化功能。验证测试表明,CatFam的平均召回率和准确度均大于95.0%。 CatFam与PIPA集成在一起。我们使用关联规则挖掘算法从带注释的样本蛋白质自动生成两个本体之间的映射。将本体的分层拓扑合并到算法中会增加生成的映射的数量。特别是,它从直系同源族群(COG)到EC编号产生了40.0%的附加映射,并且从COG到GO术语的映射增加了六倍。到EC编号的映射显示出很高的精度(99.8%)和召回率(96.6%),而到GO术语的映射显示出中等精度(80.0%)和低召回率(33.0%)。我们用于GO注释的共识算法是基于与GO词相关的似然分数的计算和传播。测试结果表明,对于给定的召回率,共识算法的应用比未使用共识时的精度更高。结论PIPA中实现的算法基于多种资源的协调预测提供了自动化的全基因组蛋白功能注释。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号