首页> 外文会议>IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology >ProLoc-rGO: Using rule-based knowledge with Gene Ontology terms for prediction of protein subnuclear localization
【24h】

ProLoc-rGO: Using rule-based knowledge with Gene Ontology terms for prediction of protein subnuclear localization

机译:ProCoC-RGO:使用基于规则的知识,具有基因本体论蛋白质序列定位预测

获取原文

摘要

Gene Ontology (GO) annotation is a controlled vocabulary of terms and phrases describing the function of genes and gene products, which has been succeeded in predicting subcellualr and subnuclear localization. Generally, each gene product is annotated by very few GO terms from more than 25,000 annotations available at present. How to represent a protein sequence using GO terms as features plays an important role in designing prediction systems for protein subnuclear localization. Our previous work ProLoc-GO can select a small number m out of a large number n GO terms, where m ≪ n. However, its off-line time for training is large up to several days even though running on high speedily PC clusters. Therefore, this study proposes an efficient system (ProLoc-rGO) by using the decision tree method to speedily mine m informative GO terms and acquire interpretable rule-based knowledge for predicting subnuclear localization. The ProLoc-rGO performing on SNL9_80 (714 proteins in nine compartments with ≪80 identity) can mine m=17 informative GO terms, 17 interpretable rules and yield training and test accuracies of 84.9% and 78.2%. For comparison, an accuracy 82.6% (Matthews correlation coefficient (MCC) = 0.711) for ProLoc-rGO performed on SNL9_80 (714 proteins in nine compartments with ≪80 identity) is obtained, which is better than 67.4% (MCC = 0.50) for Nuc-PLoc that fuses the pseudo-amino acid composition of a protein and its position-specific scoring matrix.
机译:基因本体(GO)注释是描述基因和基因产物功能的术语和短语的受控词汇,这已经成功地预测了亚细胞和序列定位。通常,每种基因产物的术语在目前的超过25,000个注释中被极少的术语注释。如何表示使用GO术语作为特征在设计蛋白质序列定位预测系统中的重要作用。我们以前的工作roloc-go可以在大量的n术语中选择一个小数字,其中m«n。但是,即使在高速PC集群上运行,它的离线时间训练的训练时间很大。因此,本研究提出了一种通过使用决策树方法来迅速挖掘M个信息的术语和获取可解释规则的基于规则的知识来提出有效的系统(ProLoc-Rgo),以便预测序列定位。在SN19_80上进行脯型rgo(714个蛋白质,九个隔间有«80个身份)可以挖掘M = 17个信息的GO条款,17项可解释规则和产量培训,测试精度为84.9%和78.2%。为了比较,获得对SNL9_80(具有«80个标识的九个隔间中的714个蛋白质)的脯胚rgo的精度82.6%(Matthews相关系数(MCC)= 0.711),其优于67.4%(MCC = 0.50) NUC-PLOC融合蛋白质的伪氨基酸组成及其定位特异性评分基质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号