首页> 外文期刊>PeerJ Computer Science >Comparison of machine learning and deep learning techniques in promoter prediction across diverse species
【24h】

Comparison of machine learning and deep learning techniques in promoter prediction across diverse species

机译:不同物种启动子预测中机器学习与深层学习技术的比较

获取原文
获取外文期刊封面目录资料

摘要

Gene promoters are the key DNA regulatory elements positioned around the transcription start sites and are responsible for regulating gene transcription process. Various alignment-based, signal-based and content-based approaches are reported for the prediction of promoters. However, since all promoter sequences do not show explicit features, the prediction performance of these techniques is poor. Therefore, many machine learning and deep learning models have been proposed for promoter prediction. In this work, we studied methods for vector encoding and promoter classification using genome sequences of three distinct higher eukaryotes viz. yeast (Saccharomyces cerevisiae), A. thaliana (plant) and human (Homo sapiens). We compared one-hot vector encoding method with frequency-based tokenization (FBT) for data pre-processing on 1-D Convolutional Neural Network (CNN) model. We found that FBT gives a shorter input dimension reducing the training time without affecting the sensitivity and specificity of classification. We employed the deep learning techniques, mainly CNN and recurrent neural network with Long Short Term Memory (LSTM) and random forest (RF) classifier for promoter classification at k-mer sizes of 2, 4 and 8. We found CNN to be superior in classification of promoters from non-promoter sequences (binary classification) as well as species-specific classification of promoter sequences (multiclass classification). In summary, the contribution of this work lies in the use of synthetic shuffled negative dataset and frequency-based tokenization for pre-processing. This study provides a comprehensive and generic framework for classification tasks in genomic applications and can be extended to various classification problems.
机译:基因启动子是定位在转录起始位点的关键DNA调节元件,并负责调节基因转录过程。报告了各种基于对准的基于信号和基于内容的方法,用于预测启动子。然而,由于所有启动子序列都没有显示出明确的特征,因此这些技术的预测性能很差。因此,已经提出了许多机器学习和深度学习模型来推动者预测。在这项工作中,我们研究了使用三个不同较高的真核生物的基因组序列进行载体编码和启动子分类的方法。酵母(酿酒酵母),A.拟南芥(植物)和人(Homo Sapiens)。我们比较了在1-D卷积神经网络(CNN)模型上的数据预处理的基于频率的标记(FBT)进行一次热向量编码方法。我们发现FBT提供了更短的输入维度,从而减少了培训时间而不影响分类的灵敏度和特异性。我们采用了深度学习技术,主要是CNN和经常性的神经网络,具有长短期记忆(LSTM)和随机森林(RF)分类器,用于在K-MES尺寸为2,4和8的促销员分类。我们发现CNN在来自非启动子序列(二元分类)的启动子的分类以及启动子序列的物种特异性分类(多标准分类)。总之,这项工作的贡献在于使用合成的混合负数数据集和频率为基于频率的预处理。本研究为基因组应用中的分类任务提供了全面而通用的框架,可以扩展到各种分类问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号