首页> 外文期刊>基因组蛋白质组与生物信息学报(英文版) >Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis
【24h】

Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

机译:使用机器学习和探索性数据分析预测人类蛋白质的亚细胞定位

获取原文
获取原文并翻译 | 示例
       

摘要

Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naive Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples,proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids;and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.
机译:鉴定蛋白质的亚细胞定位特别有助于基因产物的功能注释。在这项研究中,我们使用机器学习和探索性数据分析(EDA)技术来检查和表征位于9个细胞区室中的人类蛋白质的氨基酸序列。从SWISS-PROT数据库中提取了代表人类蛋白质的3,749个蛋白质序列的数据集。创建特征载体以捕获特定的氨基酸序列特征。相对于支持向量机,多层感知器和朴素贝叶斯分类器,C4.5决策树算法在基于蛋白质的氨基酸序列可靠地预测蛋白质的亚细胞定位方面,在所有九个区室中表现最为一致(平均精度= 0.88;平均灵敏度= 0.86)。此外,EDA图形表征每个隔室中蛋白质的基本特征。例如,定位在质膜上的蛋白质具有较高比例的疏水氨基酸。胞质蛋白的中性氨基酸比例较高;线粒体蛋白的中性氨基酸比例较高,极性氨基酸的比例较低。这些数据表明,C4.5分类器和EDA工具可以有效地基于其氨基酸序列表征和预测人蛋白质的亚细胞定位。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号