首页> 外文会议>IEEE Signal Processing in Medicine and Biology Symposium >Machine Learning Applications to DNA Subsequence and Restriction Site Analysis
【24h】

Machine Learning Applications to DNA Subsequence and Restriction Site Analysis

机译:机器学习应用于DNA后续和限制性地点分析

获取原文

摘要

Based on the BioBricks™ standard, restriction synthesis is a novel catabolic iterative DNA synthesis method that utilizes endonucleases to synthesize a query sequence from a reference sequence. In this work, the reference sequence is built from shorter subsequences by classifying them as applicable or inapplicable for the synthesis method using three different machine learning methods: Support Vector Machine (SVM), random forest, and Convolution Neural Network (CNN). Before applying these methods to the data, a series of feature selection, curation, and reduction steps are applied to create an accurate and representative feature space. Following these preprocessing steps, three different pipelines are proposed to classify subsequences based on their nucleotide sequence and other relevant features corresponding to the restriction sites of over 200 endonucleases. The sensitivity using SVM, random forest, and CNN are 94.9%, 92.7%, 91.4%, respectively. Moreover, each method scores lower in specificity with SVM, random forest, and CNN resulting in 77.4%, 85.7%, and 82.4%, respectively. In addition to analyzing these results, the misclassifications in SVM and CNN are investigated. Across these two models, different features with a derived nucleotide specificity visually contribute more to classification compared to other features. This observation is an important factor when considering new nucleotide sensitivity features for future studies.
机译:基于BioBricks™标准,限制合成是一种新的分解代谢迭代DNA合成方法,其利用内切核酸酶从参考序列合成查询序列。在这项工作中,参考序列是根据使用三种不同机器学习方法的合成方法的适用性或不适用的较短子序列构建:支持向量机(SVM),随机林和卷积神经网络(CNN)。在将这些方法应用于数据之前,应用了一系列特征选择,策划和还原步骤以创建准确和代表性的特征空间。在这些预处理步骤之后,提出了三种不同的管道,以基于其核苷酸序列和对应于200多个内切核酸酶的限制性位点的其他相关特征来分类子序列。使用SVM,随机林和CNN的敏感性分别为94.9%,92.7%,91.4%。此外,每种方法在具有SVM,随机林和CNN的特异性中得分降低,导致77.4%,85.7%和82.4%。除了分析这些结果外,还研究了SVM和CNN中的错误分类。在这两种模型中,与其他特征相比,具有衍生核苷酸特异性的不同特征与分类有关。这种观察是考虑到未来研究的新核苷酸敏感性特征的重要因素。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号