...
首页> 外文期刊>Amino Acids >Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction
【24h】

Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction

机译:Euk-PLoc:大规模真核蛋白亚细胞定位预测的整体分类器

获取原文
获取原文并翻译 | 示例
           

摘要

With the avalanche of newly-found protein sequences emerging in the post genomic era, it is highly desirable to develop an automated method for fast and reliably identifying their subcellular locations because knowledge thus obtained can provide key clues for revealing their functions and understanding how they interact with each other in cellular networking. However, predicting subcellular location of eukaryotic proteins is a challenging problem, particularly when unknown query proteins do not have significant homology to proteins of known subcellular locations and when more locations need to be covered. To cope with the challenge, protein samples are formulated by hybridizing the information derived from the gene ontology database and amphiphilic pseudo amino acid composition. Based on such a representation, a novel ensemble hybridization classifier was developed by fusing many basic individual classifiers through a voting system. Each of these basic classifiers was engineered by the KNN (K-Nearest Neighbor) principle. As a demonstration, a new benchmark dataset was constructed that covers the following 18 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cyanelle, (5) cytoplasm, (6) cytoskeleton, (7) endoplasmic reticulum, (8) extracell, (9) Golgi apparatus, (10) hydrogenosome, (11) lysosome, (12) mitochondria, (13) nucleus, (14) peroxisome, (15) plasma membrane, (16) plastid, (17) spindle pole body, and (18) vacuole. To avoid the homology bias, none of the proteins included has ≥25% sequence identity to any other in a same subcellular location. The overall success rates thus obtained via the 5-fold and jackknife cross-validation tests were 81.6 and 80.3%, respectively, which were 40–50% higher than those performed by the other existing methods on the same strict dataset. The powerful predictor, named “Euk-PLoc”, is available as a web-server at http://202.120.37.186/bioinf/euk. Furthermore, to support the need of people working in the relevant areas, a downloadable file will be provided at the same website to list the results predicted by Euk-PLoc for all eukaryotic protein entries (excluding fragments) in Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The large-scale results will be updated twice a year to include the new entries of eukaryotic proteins and reflect the continuous development of Euk-PLoc.
机译:随着后基因组时代出现的大量新发现的蛋白质序列,迫切需要开发一种自动方法来快速可靠地鉴定其亚细胞位置,因为由此获得的知识可以为揭示其功能和理解它们如何相互作用提供关键线索。在蜂窝网络中彼此交流。但是,预测真核蛋白的亚细胞位置是一个具有挑战性的问题,特别是当未知查询蛋白与已知亚细胞位置的蛋白没有显着同源性且需要覆盖更多位置时。为了应对这一挑战,通过将来自基因本体数据库的信息与两亲性伪氨基酸组成进行杂交来配制蛋白质样品。基于这样的表示,通过投票系统融合了许多基本的个体分类器,从而开发了一种新颖的整体杂交分类器。这些基本分类器均通过KNN(K最近邻)原理进行设计。作为演示,构建了一个新的基准数据集,该数据集涵盖以下18个位置:(1)细胞壁,(2)中心粒,(3)叶绿体,(4)腈,(5)细胞质,(6)细胞骨架,(7 )内质网,(8)细胞外膜,(9)高尔基体,(10)氢体,(11)溶酶体,(12)线粒体,(13)核,(14)过氧化物酶体,(15)质膜,(16)质体,(17)主轴极体和(18)液泡。为避免同源性偏倚,在相同的亚细胞位置,所含蛋白质均不得与其他任何蛋白质具有≥25%的序列同一性。这样,通过5倍交叉验证和折刀交叉验证测试获得的总体成功率分别为81.6%和80.3%,比在相同的严格数据集上通过其他现有方法执行的结果高40-50%。名为“ Euk-PLoc”的功能强大的预测器可通过Web服务器在http://202.120.37.186/bioinf/euk上获得。此外,为了满足在相关领域工作的人们的需要,将在同一网站上提供可下载文件,以列出Euk-PLoc对Swiss-Prot数据库中所有不包括在其中的真核蛋白质条目(片段除外)预测的结果具有亚细胞位置注释或由于不确定而被注释。大规模结果将每年更新两次,以包括新的真核蛋白条目,并反映Euk-PLoc的持续发展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号