...
首页> 外文期刊>Journal of Theoretical Biology >Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC
【24h】

Prediction of protein subcellular localization with oversampling approach and Chou's general PseAAC

机译:用过采样方法预测蛋白质亚细胞定位和超大PSEAAC

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Highlights ? We combined oversampling method with SVM to deal with the protein subcellular localization of unbalanced data sets. ? Results of SVM by Jackknife tests show that oversampling methods have successfully decrease the imbalance of data sets. ? The excellent overall accuracy indicates that the feature representation and selection capture useful information of protein sequence. Abstract Predicting protein subcellular location with support vector machine has been a popular research area recently because of the dramatic explosion of bioinformation. Though substantial achievements have been obtained, few researchers considered the problem of data imbalance before classification, which will lead to low accuracy for some categories. So in this work, we combined oversampling method with SVM to deal with the protein subcellular localization of unbalanced data sets. To capture valuable information of a protein, a PseAAC (Pseudo Amino Acid Composition) has been extracted from PSSM(Position-Specific Scoring Matrix) as a feature vector, and then be selected by principal component analysis (PCA). Next, samples which are treated by oversampling method to eliminate the imbalance of sample numbers in different classes are fed into support vector machine to predict the protein subcellular location. To evaluate the performance of proposed method, Jackknife tests are performed on three benchmark datasets (ZD98, CL317 and ZW225). Results of SVM experiments with and without oversampling gained by Jackknife tests show that oversampling methods have successfully decrease the imbalance of data sets, and the prediction accuracy of each class in each dataset is higher than 88.9%. With comparison with other protein subcellular localization methods, the method in this work reaches the best performance. The overall accuracies of ZD98, CL317 and ZW225 are 93.2%, 96.00% and 92.15% respectively, which are 2.4%, 8.0% and 8.2% higher than the best methods in the comparison. The excellent overall accuracy gained by the proposed method indicates that the feature representation and selection capture useful information of protein sequence and oversampling methods successfully solve the imbalance of sample numbers in SVM classification.
机译:强调 ?我们将过采样方法与SVM组合处理不平衡数据集的蛋白质亚细胞定位。还SVM通过jackknife测试的结果表明,过采样方法已成功降低数据集的不平衡。还优异的整体精度表明特征表示和选择捕获蛋白质序列的有用信息。摘要预测蛋白质亚细胞位置与支持向量机最近是一个流行的研究区域,因为生物信息的剧烈爆炸。虽然已经获得了大量成就,但很少有研究人员认为分类前的数据不平衡问题,这将导致某些类别的低准确性。因此,在这项工作中,我们将带有SVM的过采样方法组合了解不平衡数据集的蛋白质亚细胞定位。为了捕获蛋白质的宝贵信息,已从PSSM(特异性评分基质)作为特征载体中提取PSEAAC(假氨基酸组合物),然后通过主成分分析(PCA)选择。接下来,通过过采样方法处理的样品以消除不同类别中的样品数不平衡,进入支持向量机以预测蛋白质亚细胞位置。为了评估所提出的方法的性能,jackknife测试在三个基准数据集(ZD98,CL317和ZW225)上执行。 jackknife测试中获得的SVM实验结果表明,过采样方法已成功降低数据集的不平衡,每个数据集中的每个类的预测精度高于88.9%。随着与其他蛋白质亚细胞定位方法的比较,该工作中的方法达到了最佳性能。 ZD98,Cl317和ZW225的总体精度分别为93.2%,96.00%和92.15%,比比较中最佳方法的2.4%,8.0%和8.2%。所提出的方法所获得的优异总体精度表明,特征表示和选择捕获蛋白质序列和过采样方法的有用信息成功解决了SVM分类中的样品数的不平衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号