首页> 美国卫生研究院文献>Molecules >Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features
【2h】

Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

机译:基于多视角特征融合的蛋白质亚细胞定位预测

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou’s pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.
机译:蛋白质亚细胞定位的预测对于推断蛋白质功能,基因调控和蛋白质-蛋白质相互作用至关重要。随着高通量测序技术和蛋白质组学方法的发展,众多酵母的蛋白质序列已公开可用,这使我们能够通过计算预测酵母蛋白质的亚细胞定位。但是,广泛使用的蛋白质序列表示技术(例如氨基酸组成和Chou的伪氨基酸组成(PseAAC))很难提取有关残基之间相互作用和每个残基位置分布的足够信息。因此,迫切需要开发新颖的序列表示。在这项研究中,我们提出了两种新颖的蛋白质序列表示技术,包括基于蛋白质一级序列中残基的频率和分布的广义混沌博弈表示(GCGR),以及反映蛋白质局部位置信息的新型统计和信息论(NSI)。序列。在GCGR + NSI表示中,蛋白质一级序列仅由5维特征向量表示,而其他流行方法(如PseAAC和二肽)则具有数百维特征。在实践中,特征表示在预测蛋白质亚细胞定位中非常有效。即使不使用基于机器学习的分类器,基于特征向量的简单模型也可以分别为CL317和ZW225数据集实现0.8825和0.7736的预测精度。为了进一步评估提出的编码方案的有效性,我们引入了一种基于多视图特征的方法,将上述两个特征与其他众所周知的特征(包括PseAAC和二肽组成)相结合,并使用支持向量机作为分类器预测蛋白质亚细胞定位。这个新颖的模型对CL317和ZW225数据集的预测精度分别为0.927和0.871,优于折刀测试中的其他现有方法。结果表明,GCGR和NSI功能是预测酵母蛋白亚细胞定位中流行蛋白序列表示形式的有用补充。最后,我们通过权威期刊和书籍中一些已发表文章的证据验证了一些新预测的蛋白质亚细胞定位。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号