首页> 中文期刊> 《基因组蛋白质组与生物信息学报:英文版》 >Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

     

摘要

Identifying the subcellular localization of proteins is particularly helpful in the func- tional annotation of gene products. In this study, we use Machine Learning and Ex- ploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Na¨?ve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predict- ing the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.

著录项

  • 来源
  • 作者单位

    Department of Pharmaceutical Sciences;

    School of Pharmacy-Worcester;

    Massachusetts College of Pharmacyand Health Sciences;

    Worcester;

    MA 01608-1715;

    USA;

    Center for Computational Pharmacology;

    Departmentof Pharmacology;

    University of Colorado School of Medicine;

    Aurora;

    CO 80010;

    USA;

    Gen*NY*Sis Centerfor Excellence in Cancer Genomics;

    Department of Epidemiology and Biostatistics;

    State University of NewYork at Albany;

    Rensselaer;

    NY 12144-3456;

    USA.;

  • 原文格式 PDF
  • 正文语种 chi
  • 中图分类 蛋白质;
  • 关键词

    亚细胞; 人类; 蛋白质; 数据分析;

    机译:亚细胞定位;机器学习;探索性数据分析;决策树;
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号