...
首页> 外文期刊>International Journal of Basic and Applied Biology: IJBAB >Eukaryotic Donor Splice Site Prediction: A Machine Learning Approach
【24h】

Eukaryotic Donor Splice Site Prediction: A Machine Learning Approach

机译:真核供体剪接部位预测:机器学习方法

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Identifying the genes accurately is one of the most important and challenging task in bioinformatics and its success depends on the precise identification of splice sites. As AG and GT di-nucleotide represent possible donor and acceptor splice sites, every AG and GT in a DNA sequence is a candidate acceptor and donor splice site and they need to be classified as either a real splice site or a pseudo splice site. Given that AG and GT di-nucleotide occurs very frequently at non-splice-site positions, it is very hard to identify a true donor/acceptor splice site from a false splice site. Various computational methods have been developed for splice site prediction and among them machine learning methods have been more successful. In splice site prediction using machine learning approaches, features vector are generated through different encoding schema. In this investigation, an attempt is made to develop a new sequence encoding approach based on the di-nucleotide association. The encoded sequence data are then used for the prediction of donor splice sites using Artificial Neural Network (ANN), Support Vector Machine (SVM) and Random Forest (RF) methodology, following 10-fold cross validation techniques. Combination of SVM and RF coupled with proposed encoding approach achieved better accuracy as compared to the other combinations in terms of area under Receiving Operating Characteristics (ROC) curve (AUC). The performance of the proposed was also compared with several existing approaches using.
机译:准确地识别基因是生物信息学中最重要和最具挑战性的任务之一,其成功取决于接头位点的精确鉴定。由于Ag和GT二核苷酸代表可能的供体和受体接头位点,DNA序列中的每一个AG和GT是候选受体和供体剪接位点,并且它们需要被分类为真正的剪接部位或伪剪接部位。鉴于Ag和GT二核苷酸在非剪接部位位置经常发生,很难从假剪切部位鉴定真正的供体/受体剪接部位。已经开发了各种计算方法,用于接头站点预测,其中机器学习方法更加成功。在使用机器学习方法的剪接站点预测中,通过不同的编码模式生成特征向量。在该研究中,尝试基于二核苷酸关联的新序列编码方法。然后,使用人工神经网络(ANN),支持向量机(SVM)和随机森林(RF)方法,在10倍交叉验证技术之后,对编码的序列数据进行预测。与所提出的编码方法耦合的SVM和RF的组合在接受操作特性(ROC)曲线(AUC)下的区域中的其他组合相比,实现了更好的准确性。该提议的性能也与几种现有方法进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号