首页> 外文期刊>Engineering Applications of Artificial Intelligence >Identification of DNA-protein binding sites by bootstrap multiple convolutional neural networks on sequence information
【24h】

Identification of DNA-protein binding sites by bootstrap multiple convolutional neural networks on sequence information

机译:通过自举多重卷积神经网络对序列信息进行DNA-蛋白质结合位点的鉴定

获取原文
获取原文并翻译 | 示例
           

摘要

Identification of DNA-protein binding sites in protein sequence plays an essential role in a wide variety of biological processes. In particular, there are huge volumes of protein sequences accumulated in the post-genomic era. In this study, we propose a new prediction approach appropriate for imbalanced DNA-protein binding sites data. Specifically, motivated by the imbalanced problem of the distribution of DNA-protein binding and non-binding sites, we employ the Adaptive Synthetic Sampling (ADASYN) approach to over-sample the positive data and Bootstrap strategy to under-sample the negative data to balance the number of the binding and nonbinding samples. Furthermore, we employ the three types of features: the position specific scoring matrix, one-hot encoding and predicted solvent accessibility, to encode the sequence-based feature of each protein residue. In addition, we design an ensemble convolutional neural network classifier to handle the imbalance problem between binding and non-binding sites in protein sequence. Extensive experiments were conducted on the real DNA-protein binding sites dataset, PDNA-543, PDNA-224 and PDNA-316, in order to validate the effectiveness of our method on predicting the binding sites by ten-fold cross-validation metric. The experimental results demonstrate that our method achieves a high prediction performance and outperforms the state-of-the-art sequence-based DNA-protein binding sites predictors in terms of the Sensitivity, Specificity, Accuracy, Precision and Mathew's Correlation Coefficient (MCC). Our method can obtain the MCC values of 0.63, 0.48 and 0.67 on PDNA-543, PDNA-224 and PDNA-316 datasets, respectively. Compared with the state-of-the art prediction models, the MCC values for our method are increased by at least 0.24, 0.13 and 0.23 on PDNA-543, PDNA-224 and PDNA-316 datasets, respectively.
机译:蛋白质序列中DNA-蛋白质结合位点的鉴定在多种生物学过程中起着至关重要的作用。特别地,在后基因组时代积累了大量的蛋白质序列。在这项研究中,我们提出了一种适用于不平衡的DNA-蛋白质结合位点数据的新预测方法。具体来说,受DNA-蛋白质结合和非结合位点分布不平衡问题的影响,我们采用自适应合成采样(ADASYN)方法对正数据进行过采样,而Bootstrap策略对负数据进行过采样以平衡结合和不结合样本的数量。此外,我们采用了三种类型的特征:位置特异性评分矩阵,一键编码和预测的溶剂可及性,以编码每个蛋白质残基的基于序列的特征。此外,我们设计了一个集成的卷积神经网络分类器来处理蛋白质序列中结合位点和非结合位点之间的不平衡问题。为了对真实的DNA-蛋白质结合位点数据集PDNA-543,PDNA-224和PDNA-316进行广泛的实验,以验证我们的方法通过十倍交叉验证度量来预测结合位点的有效性。实验结果表明,我们的方法在灵敏度,特异性,准确性,精确度和马修相关系数(MCC)方面均达到了较高的预测性能,并且优于基于最新序列的DNA-蛋白质结合位点预测器。我们的方法可以分别在PDNA-543,PDNA-224和PDNA-316数据集上获得0.63、0.48和0.67的MCC值。与最新的预测模型相比,我们的方法的MCC值在PDNA-543,PDNA-224和PDNA-316数据集上分别增加了至少0.24、0.13和0.23。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号