...
首页> 外文期刊>Microbial Genomics >NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data
【24h】

NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

机译:nonclasgp-pred:通过集成不平衡数据的子集特定的最佳模型,通过集成子集特定的最佳模型来稳健和有效地预测非典型分泌的蛋白

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew’s correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.
机译:非典型分泌的蛋白质(NCSP)是位于细胞外环境中的蛋白质,但是缺乏已知的信号肽或分泌基序。它们通常在细胞内和细胞外环境中进行不同的生物学功能,并且其几种生物学功能与细菌毒力和细胞防御有关。准确的蛋白质定位对于所有生物体至关重要,然而,为NCSP识别开发的现有方法的性能已经不满意,特别是存在数据缺陷和可能的过度问题。进一步改进是理想的,特别是为了解决不平衡数据集中缺乏信息性功能和挖掘子集特定功能。在本研究中,开发了一种新的计算预测因子用于NCSP对革兰氏阳性细菌的预测。首先,为了解决由数据不平衡问题引起的可能预测偏差,为集合模型构造产生了十个平衡的子地图。然后,使用与顺序前进搜索组合的F分数算法用于加强每个训练子场的特征表示能力。第三,采用特定于特定的最佳特征组合过程来表征来自不同方面的原始数据,并将所有基于子地图的模型集成到统一模型中,非CLASGP-PRED,其精度为93.23%,实现了优异的性能,灵敏度为100%,特异性为89.01%,马太基的相关系数为87.68%,曲线值下的面积为0.9975,用于十倍交叉验证。基于对独立测试数据集的评估,所提出的模型优于最先进的可用工具包。有关可用性和实现,请参阅:http://lab.malab.cn/~wangchao/softwares/nonclasgp/。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号