首页> 外文期刊>Biology Direct >A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples
【24h】

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

机译:一种基于短窗口大小和不平衡大样本预测供体剪接位点的高性能方法

获取原文
获取外文期刊封面目录资料

摘要

Splice sites prediction has been a long-standing problem in bioinformatics. Although many computational approaches developed for splice site prediction have achieved satisfactory accuracy, further improvement in predictive accuracy is significant, for it is contributing to predict gene structure more accurately. Determining a proper window size before prediction is necessary. Overly long window size may introduce some irrelevant features, which would reduce predictive accuracy, while the use of short window size with maximum information may performs better in terms of predictive accuracy and time cost. Furthermore, the number of false splice sites following the GT–AG rule far exceeds that of true splice sites, accurate and rapid prediction of splice sites using imbalanced large samples has always been a challenge. Therefore, based on the short window size and imbalanced large samples, we developed a new computational method named chi-square decision table (χ2-DT) for donor splice site prediction. Using a short window size of 11?bp, χ2-DT extracts the improved positional features and compositional features based on chi-square test, then introduces features one by one based on information gain, and constructs a balanced decision table aimed at implementing imbalanced pattern classification. With a 2000:271,132 (true sites:false sites) training set, χ2-DT achieves the highest independent test accuracy (93.34%) when compared with three classifiers (random forest, artificial neural network, and relaxed variable kernel density estimator) and takes a short computation time (89?s). χ2-DT also exhibits good independent test accuracy (92.40%), when validated with BG-570 mutated sequences with frameshift errors (nucleotide insertions and deletions). Moreover, χ2-DT is compared with the long-window size-based methods and the short-window size-based methods, and is found to perform better than all of them in terms of predictive accuracy. Based on short window size and imbalanced large samples, the proposed method not only achieves higher predictive accuracy than some existing methods, but also has high computational speed and good robustness against nucleotide insertions and deletions. This article was reviewed by Ryan McGinty, Ph.D. and Dirk Walther.
机译:剪接位点预测一直是生物信息学中的一个长期存在的问题。尽管为拼接位点预测开发的许多计算方法均已达到令人满意的精度,但是预测精度的进一步提高是重要的,因为它有助于更​​准确地预测基因结构。在预测之前确定适当的窗口大小是必要的。窗口大小过长可能会引入一些不相关的功能,这会降低预测准确性,而在信息准确度和时间成本方面,将窗口大小短并带有最大信息的性能可能会更好。此外,遵循GT–AG规则的错误剪接位点的数量远远超过了真正的剪接位点,使用不平衡的大样本准确,快速地预测剪接位点一直是一个挑战。因此,基于短窗口大小和不平衡的大样本,我们开发了一种新的计算方法,称为卡方决策表(χ2-DT),用于供体剪接位点预测。 χ2-DT使用11?bp的短窗口大小,基于卡方检验提取改进的位置特征和组成特征,然后基于信息增益逐个引入特征,并构造一个旨在实现不平衡模式的平衡决策表分类。通过2000:271,132(真实位置:错误位置)训练集,与两个分类器(随机森林,人工神经网络和松弛变量核密度估计器)相比,χ2-DT可获得最高的独立测试准确度(93.34%)。较短的计算时间(89?s)。当使用带有移码错误(核苷酸插入和缺失)的BG-570突变序列进行验证时,χ2-DT还表现出良好的独立测试准确性(92.40%)。此外,将χ2-DT与基于长窗口大小的方法和基于短窗口大小的方法进行比较,发现在预测准确性方面,它们的性能优于所有方法。基于短窗口大小和大样本不平衡的情况,该方法不仅比现有方法具有更高的预测精度,而且计算速度快,对核苷酸的插入和缺失具有良好的鲁棒性。本文由Ryan McGinty博士审阅。和德克·沃尔瑟(Dirk Walther)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号