首页> 外文期刊>Proteins: Structure, Function, and Genetics >PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.
【24h】

PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis.

机译:PSLDoc:基于缺口二肽和概率潜在语义分析的蛋白质亚细胞定位预测。

获取原文
获取原文并翻译 | 示例
       

摘要

Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847; Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in thefuture. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/~ bioapp/PSLDoc/.
机译:蛋白质亚细胞定位(PSL)的预测对于基因组注释,蛋白质功能预测和药物发现非常重要。近年来,针对革兰氏阴性细菌提出了许多基于蛋白质序列的PSL预测的计算方法。我们提出了PSLDoc,一种基于缺口二肽和概率潜在语义分析(PLSA)的方法来解决此问题。蛋白质被认为是由空缺的二肽组成的术语字符串,其定义为被一个或多个位置分隔的任何两个残基。缺口二肽的加权方案是根据位置特异性得分矩阵计算的,该矩阵包括序列进化信息。然后,将PLSA应用于特征约简,并将约简向量输入到五个单靠休息支持向量机分类器。具有最高概率的定位站点被分配为最终预测。据报道,序列同源性与亚细胞定位之间有很强的相关性(Nair and Rost,Protein Sci 2002; 11:2836-2847; Yu等人,Proteins 2006; 64:643-651)。为了正确评估PSLDoc的性能,可以将目标蛋白分为低或高同源性数据集。 PSLDoc的低同源性数据集和高同源性数据集的整体准确度分别达到86.84%和98.21%,与CELLO II相比具有优势(Yu等人,Proteins 2006; 64:643-651)。此外,我们设置了置信度阈值以在指定的召回率水平上实现高精度。当置信度阈值设置为0.7时,PSLDoc的精度达到97.89%,大大优于PSORTb v.2.0(Gardy等人,Bioinformatics 2005; 21:617-623)。我们的方法证明蛋白质的特定特征表示可以成功地应用于蛋白质亚细胞定位的预测并提高预测准确性。此外,由于表示的普遍性,我们的方法可以扩展到将来的真核蛋白质组中。 PSLDoc的Web服务器可从http://bio-cluster.iis.sinica.edu.tw/~bioapp/PSLDoc/公开获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号