首页> 外文会议>DoD High Performance Computing Modernization Program Users Group Conference >PIPA: A High-Throughput Pipeline for Protein Function Annotation
【24h】

PIPA: A High-Throughput Pipeline for Protein Function Annotation

机译:PIPA:用于蛋白质功能注释的高吞吐量管线

获取原文
获取外文期刊封面目录资料

摘要

Traditional experimental methods to determine the functions of proteins encoded in genomic sequences cannot keep pace with the avalanche of sequence data produced by new high-throughput sequencing technologies. This prompted the development of numerous bioinformatics approaches for automated protein function annotation. However, different function classification terminologies are frequently used by these different approaches, precluding the integration of multisource predictions. We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a highperformance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipeline’s parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an inhouse- developed database for enzyme catalytic function prediction (CatFam) and the results of multiple integrated resources, such as the 11 member databases of InterPro and the Conserved Domains Database, into common GO terms. A Web-page-based graphical user interface is dev--eloped based on the User Interface Toolkit. The pipeline is deployed on two LINUX clusters, JVN at the Army Research Laboratory Major Shared Resource Center and JAWS at the Maui High Performance Computing Center. Currently, scientists at the Naval Medical Research Center are using PIPA to predict protein functions for newly sequenced bacterial pathogens and their near-neighbor strains. Validation tests show that, on average, the CatFam database yields predictions of enzyme catalytic functions with accuracy greater than 95%. Test results of the consensus GO annotation show an improvement in performance of up to 8% when compared with annotations in which consensus is not used.
机译:传统的实验方法确定在基因组序列中编码的蛋白质的功能不能与新的高通量测序技术产生的序列数据的雪崩节目。这促使开发了许多用于自动蛋白函数注释的生物信息学方法。然而,这些不同的方法经常使用不同的功能分类术语,禁止集成多源预测。我们开发了蛋白质注释(PIPA)的管道,一种在高度计算环境中运行的基因组型蛋白质函数注释管道。 PIPA集成了不同的工具,采用基因本体(GO)提供一致的注释和解决预测冲突。 PIPA有三个模块,可轻松开发专门的数据库和各种生物信息学工具的集成。第一模块,流水线执行模块包括能够使用户访问和控制流水线的并行执行多个作业的程序,每个程序都在搜索特定数据库以进行输入数据的块。执行模块包裹第二个模块,核心管道模块。综合资源,术语转换程序的程序,以及共识注释程序构成了核心模块的主要组成部分。第三个模块是预处理模块。最后一个模块包含用于自定义蛋白函数数据库的程序和Go-Mapping生成程序的程序,它为术语转换程序创建了Go映射。 PIPA的目前的实施通过将酶催化功能预测(CATFAM)的蛋白开发的数据库的结果与多个综合资源的结果组合来注释蛋白质功能,例如Acresto和保守域数据库的11个成员数据库,以共同GO条款。基于网络页面的图形用户界面是基于用户界面工具包的开发方式。管道部署在两个Linux集群上,JVN在军队研究实验室主要共用资源中心和毛伊岛高性能计算中心的颌。目前,海军医学研究中心的科学家正在使用PIPA预测新序列的细菌病原体及其近邻菌株的蛋白质功能。验证测试表明,平均而言,CATFAM数据库产生预测酶催化功能,精度大于95%。与未使用共有共识的注释相比,共识的测试结果显示出现高达8%的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号