...
首页> 外文期刊>Bioinformatics >Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature
【24h】

Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature

机译:使用具有混合特征的随机森林模型从氨基酸序列预测蛋白质中的DNA结合残基

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Motivation: In this work, we aim to develop a computational approach for predicting DNA-binding sites in proteins from amino acid sequences. To avoid overfitting with this method, all available DNA-binding proteins from the Protein Data Bank (PDB) are used to construct the models. The random forest (RF) algorithm is used because it is fast and has robust performance for different parameter values. A novel hybrid feature is presented which incorporates evolutionary information of the amino acid sequence, secondary structure (SS) information and orthogonal binary vector (OBV) information which reflects the characteristics of 20 kinds of amino acids for two physical-chemical properties (dipoles and volumes of the side chains). The numbers of binding and non-binding residues in proteins are highly unbalanced, so a novel scheme is proposed to deal with the problem of imbalanced datasets by downsizing the majority class.Results: The results show that the RF model achieves 91.41% overall accuracy with Matthew's correlation coefficient of 0.70 and an area under the receiver operating characteristic curve (AUC) of 0.913. To our knowledge, the RF method using the hybrid feature is currently the computationally optimal approach for predicting DNA-binding sites in proteins from amino acid sequences without using three-dimensional (3D) structural information. We have demonstrated that the prediction results are useful for understanding protein-DNA interactions.Availability: DBindR web server implementation is freely available at http://www.cbi.seu.edu.cn/DBindR/DBindR.htm.Contact: xsuneu.edu.cnSupplementary information: Supplementary data are available at Bioinformatics online.
机译:动机:在这项工作中,我们旨在开发一种计算方法来预测氨基酸序列中蛋白质的DNA结合位点。为避免此方法过度拟合,使用了来自蛋白质数据库(PDB)的所有可用DNA结合蛋白来构建模型。之所以使用随机森林(RF)算法,是因为它速度快并且对于不同的参数值具有鲁棒的性能。提出了一种新颖的杂交特征,该特征融合了氨基酸序列的进化信息,二级结构(SS)信息和正交二元载体(OBV)信息,该信息反映了两种物理化学性质(偶极子和体积)的20种氨基酸的特征的侧链)。蛋白质中结合残基和非结合残基的数量高度不平衡,因此提出了一种通过减小多数类来解决数据集不平衡问题的新方案。结果:结果表明,RF模型在总体精度上达到91.41%。马修的相关系数为0.70,接收器工作特性曲线(AUC)下的面积为0.913。据我们所知,使用混合特征的RF方法目前是计算上的最佳方法,可从氨基酸序列预测蛋白质中的DNA结合位点,而无需使用三维(3D)结构信息。我们已经证明预测结果对理解蛋白质-DNA相互作用很有用。可用性:DBindR Web服务器实现可从http://www.cbi.seu.edu.cn/DBindR/DBindR.htm免费获得。联系人:xsuneu。 edu.cn补充信息:补充数据可从Bioinformatics在线获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号