首页> 外文会议>IEEE International Conference on Computational lntelligence and Virtual Environments for Measurement Systems and Applications >An efficient comparative machine learning-based metagenomics binning technique via using Random forest
【24h】

An efficient comparative machine learning-based metagenomics binning technique via using Random forest

机译:使用随机森林的一种高效的基于比较机器学习的宏基因组学分箱技术

获取原文
获取外文期刊封面目录资料

摘要

Metagenomics is the study of microorganisms collected directly from natural environments. Metagenomics studies use DNA fragments obtained directly from a natural environment using whole genome shotgun (WGS) sequencing. Sequencing random fragments obtained from whole genome shotgun into taxa-based groups is known as binning. Currently, there are two different methods of binning: sequence similarity methods and sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known genome like BLAST, or MEGAN. As only a very small fraction of species is available in the current databases, similarity methods do not yield good results. As a given database of organisms grows, the complexity of the search will also grow. Sequence composition methods are based on compositional features of a given DNA sequence like K-mers, or other genomic signature(s). Most of these current methods for binning have two major issues: they do not work well with short sequences and closely related genomes. In this paper we propose new machine learning related predictive DNA sequence feature selection algorithms to solve binning problems in more accurate and efficient ways. In this work we use Oligonucleotide frequencies from 2-mers to 4-mers as features to differentiate between sequences. 2-mers produces 16 features, 3-mers produces 64 features and 4-mers produces 256 features. We did not use feature higher than 4-mers as the number of feature increases exponentially and for 5-mers the number of feature would be 1024 features. We found out that the 4-mers produces better results than 2-mers and 3-mers. The data used in this work has an average length of 250, 500, 1000, and 2000 base pairs. Experimental results of the proposed algorithms are presented to show the potential value of the proposed methods. The proposed algorithm accuracy is tested on a variety of data sets and the classification/prediction accuracy achieved is between 78% – 99% for vari- us simulated data sets using Random forest classifier and 37% – 95% using Naïve Bayes classifier. Random forest Classifier did better in classification in all the dataset compared to Naïve Bayes.
机译:元基因组学是对直接从自然环境中收集的微生物的研究。元基因组学研究使用直接从自然环境中获得的DNA片段,通过全基因组shot弹枪(WGS)测序。将从全基因组shot弹枪获得的随机片段测序到基于分类单元的组中,这被称为分箱。当前,有两种不同的合并方法:序列相似性方法和序列组成方法。序列相似性方法通常基于与已知基因组(如BLAST或MEGAN)的序列比对。由于当前数据库中仅一小部分物种可用,因此相似性方法无法产生良好的结果。随着给定的生物数据库的增长,搜索的复杂性也将增加。序列组成方法基于给定DNA序列(例如K-mers)或其他基因组特征的组成特征。这些当前的装箱方法大多数都存在两个主要问题:它们不适用于短序列和紧密相关的基因组。在本文中,我们提出了一种新的与机器学习相关的预测性DNA序列特征选择算法,以更准确和有效的方式解决装箱问题。在这项工作中,我们使用从2聚体到4聚体的寡核苷酸频率作为区分序列的特征。 2聚体产生16个特征,3聚体产生64个特征,而4聚体产生256个特征。我们不使用高于4个单体的特征,因为特征数量呈指数增加,而对于5个单体,特征数量将为1024个特征。我们发现4-聚体比2-聚体和3-聚体产生更好的结果。这项工作中使用的数据的平均长度为250、500、1000和2000个碱基对。提出的算法的实验结果表明了所提方法的潜在价值。所提出的算法准确性在各种数据集上进行了测试,使用随机森林分类器对各种模拟数据集的分类/预测准确性在78%– 99%之间,而使用朴素贝叶斯分类器则在37%– 95%之间。与朴素贝叶斯相比,随机森林分类器在所有数据集中的分类方面表现更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号