首页> 外文会议>IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology >Phenotype Prediction of DNA Sequence Data: A Machine- and Statistical Learning Approach
【24h】

Phenotype Prediction of DNA Sequence Data: A Machine- and Statistical Learning Approach

机译:DNA序列数据的表型预测:机器和统计学习方法

获取原文

摘要

Advancements made in high-throughput sequencing technologies have continued to generate large amounts of sequencing data enabling the holistic investigation of complex biological phenomena. Genomic sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid expansion in available sequence data poses a tremendous computational challenge, calling for the development of novel data processing and analytic methods, as well as computing resources to match the volume of these datasets. In this work, a machine- and statistical learning approach for classification based on k-mer representations of DNA sequence data is proposed. While targeted sequencing focuses on a specific region of interest, whole genome sequencing enables a view of a species’ entire genome. Thus, the approach is tested using whole genome sequences of Mycobacterium tuberculosis isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, and (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate. Furthermore, the computing challenges associated with whole genome sequence data analyses in producing interpretable and explainable insights are described. Classification models were trained using 104 Mycobacterium tuberculosis isolates. Cluster analyses showed that k-mers can be used to discriminate phenotypes and the discrimination becomes more concise as the k-mer size increases. The best performing classification model had a k-mer size of 10 (longest k-mer considered in this study) an accuracy, recall, precision, specificity, and Matthews Correlation coefficient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4, respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biologically explainable results, highlighting the interplay that exists between accuracy, computing resources such as processing and memory, and explainability of classification results. Furthermore, the analysis provides a new way to extract genetic information from genomic data and identify phenotype relationships which are integral for explaining complex biological mechanisms.
机译:高通量测序技术的进步继续产生大量测序数据,从而实现复杂生物现象的整体调查。基因组序列数据用于各种应用,例如基因注释,表达研究,个性化治疗和精密药物。然而,可用序列数据中的这种快速扩展构成了巨大的计算挑战,呼吁开发新颖的数据处理和分析方法,以及计算资源以匹配这些数据集的卷。在这项工作中,提出了一种基于DNA序列数据的K-MER表示的分类的机器和统计学习方法。虽然靶向测序侧重于特定的感兴趣区域,但全基因组测序使得能够观察物种的整个基因组。因此,使用结核分枝杆菌的全基因组序列来测试方法,分枝杆菌分离物与(i)减少基因组序列数据的尺寸,(ii)识别K-MERS的最佳尺寸,并利用它来构建分类模型,(iii)预测给定细菌分离物的全基因组序列数据的表型。此外,描述了与整个基因组序列数据分析在产生可解释和解释的见解中的计算挑战。使用104分枝杆菌分离物培训分类模型。群集分析表明,K-MERS可用于区分表型,并且随着K-MER大小的增加,歧视变得更加简洁。最好的分类模型的K-MES大小为10(本研究中考虑的最长K-ME)的准确性,召回,精确,特异性,并且马修斯相关系数为72.0%,80.5%,80.5%,63.6%和0.4分别。本研究提供了重新采样全基因组测序数据的综合方法,客观地选择K-MER大小,并对表型预测进行分类。分析还突出了增加k-mer大小以产生更多生物学解释的结果的重要性,突出显示精度之间存在的相互作用,计算资源,如处理和内存,以及分类结果的解释性。此外,分析提供了一种从基因组数据中提取遗传信息的新方法,并鉴定一种用于解释复杂生物机制的表型关系。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号