首页> 外文期刊>BMC Genomics >ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels
【24h】

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels

机译:ML-DSP:机器学习,用于所有分类水平的超快,准确,可扩展的基因组分类的数字信号处理

获取原文
           

摘要

Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of 97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the "Purine/Pyrimidine", "Just-A" and "Real" numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
机译:尽管基因组序列的比较,分析,鉴定和分类,但是由于数据集的大小以及与分类相关的内在问题,分类分析,鉴定和分类比较,分析,鉴定和分类,但分类分类仍然挑战。需要一种解决现有对准的方法的限制的方法和软件工具,以及最近提出的无序方法的挑战。我们提出了一种具有数字信号处理的监督机器学习的新组合,导致ML-DSP:用于所有分类水平的超快,准确,可扩展的基因组分类的无序软件工具。我们通过在各种分类水平的各种分类水平分类到属的7396个全部线粒体基因组来测试ML-DSP,平均分类精度> 97%。在两个小型基准数据集和一个大4322脊椎动物MTDNA基因组数据集上执行与最先进的分类软件工具的定量比较。我们的结果表明,ML-DSP在处理时间方面占据了基于对准的基于的软件Mega7(与肌肉或Clustalw的对齐),同时具有用于小型数据集的可比分类精度和大型数据集的卓越精度。与自由排列软件FFP(特征频率曲线)相比,ML-DSP具有明显更好的分类精度,并且整体更快。我们还提供了指示用于其他数据集的ML-DSP潜力的初步实验,通过将4271综合病毒基因组分类为具有100%精度的亚型,4,710个细菌基因组,精度为95.5%。最后,我们的分析表明DNA序列的“嘌呤/嘧啶”,“只-A”和“真实”数值表示,用于DNA分类目的的数字信号处理文献中使用的十种其他这样的数值表示。由于其卓越的分类准确性,速度和大型数据集的可扩展性,ML-DSP在新发现的生物体的分类中具有高度相关的,在分区基因组特征和识别其机制决定因素,以及评估基因组完整性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号