【24h】

Genomic Sequence Classification Using Probabilistic Topic Modeling

机译:使用概率主题建模的基因组序列分类

获取原文

摘要

Taxonomic classification of genomic sequences is usually based on evolutionary distance obtained by alignment. In this work we introduce a novel alignment-free classification approach based on probabilistic topic modeling. Using a k-mer (small fragments of length k) decomposition of DNA sequences and the Latent Dirichlet Allocation algorithm, we built a classifier for 16S rRNA bacterial gene sequences. We tested our method with a tenfold cross validation procedure considering a bacteria dataset of 3000 elements belonging to the most numerous bacteria phyla: Actinobacteria, Firmicutes and Proteobacteria. Experiments were carried out using complete and 400 bp long 16S sequences, in order to test the robustness of the proposed methodology. Our results, in terms of precision scores and for different number of topics, ranges from 100%, at class level, to 77% at genus level, for both full and 400 bp length, considering k-mers of length 8. These results demonstrate the effectiveness of the proposed approach.
机译:基因组序列的分类学分类通常基于通过比对获得的进化距离。在这项工作中,我们介绍了一种基于概率主题建模的新颖的无对齐分类方法。使用DNA序列的k-mer(长度为k的小片段)分解和Latent Dirichlet分配算法,我们建立了16S rRNA细菌基因序列的分类器。我们使用十倍交叉验证程序测试了我们的方法,其中考虑了属于最多细菌门的3000个元素的细菌数据集:放线菌,硬毛和变形杆菌。为了测试提出的方法的鲁棒性,使用完整的和400 bp长的16S序列进行了实验。就全长和400 bp长度而言,考虑到长度为8的k聚体,我们的结果在准确性得分和不同主题数方面,从全班水平的100%到属水平的77%不等。拟议方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号