...
首页> 外文期刊>BMC Bioinformatics >Comparing K-mer based methods for improved classification of 16S sequences
【24h】

Comparing K-mer based methods for improved classification of 16S sequences

机译:比较基于K-mer的方法以改进16S序列的分类

获取原文
           

摘要

The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used na?ve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the na?ve Bayes RDP method. On fragmented sequences the na?ve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.
机译:在现代微生物学中,对精确和稳定的分类学分类的需求高度相关。与可访问的序列数据数量激增并行的同时,分类方法的重点也在转移。以前,基于比对的方法是最适用的工具。现在,就滑动速度和准确性而言,基于滑动窗口计数K-mers的方法是最有趣的分类方法。在这里,我们目前对16S rRNA基因的五种不同的基于K-mer的分类方法进行系统比较。这些方法在数据使用和建模策略上都互不相同。我们的研究基于RDP项目中广为人知且使用简单的朴素贝叶斯分类器,并且在两种不同的数据集,全长序列以及典型阅读片段的片段上实施和测试了其他四种方法长度。通过这些方法获得的分类误差差异似乎很小,但是它们是稳定的,并且对于两个数据集都进行了测试。对于全长16S rRNA序列,预处理最近邻(PLSNN)方法效果最好,明显优于纯朴素贝叶斯RDP方法。在片段序列上,朴素的贝叶斯多项式方法表现最佳,明显优于所有其他方法。对于所探索的两个数据集,以及在全长序列和片段化序列上,所有五种方法均达到了误差平台。我们得出结论,没有一种基于K-mer的方法普遍适用于对全长序列和片段(读段)进行分类。所有方法都接近误差平台,表明需要改进的训练数据才能从此处改进分类。分类错误最常发生,几乎没有序列。对于改进分类法和测试新的分类方法,至关重要的是需要更好,更通用,更强大的训练数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号