An efficient comparative machine learning-based metagenomics binning technique via using Random forest

机译：使用随机森林的一种高效的基于比较机器学习的宏基因组学分箱技术

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Metagenomics is the study of microorganisms collected directly from natural environments. Metagenomics studies use DNA fragments obtained directly from a natural environment using whole genome shotgun (WGS) sequencing. Sequencing random fragments obtained from whole genome shotgun into taxa-based groups is known as binning. Currently, there are two different methods of binning: sequence similarity methods and sequence composition methods. Sequence similarity methods are usually based on sequence alignment to known genome like BLAST, or MEGAN. As only a very small fraction of species is available in the current databases, similarity methods do not yield good results. As a given database of organisms grows, the complexity of the search will also grow. Sequence composition methods are based on compositional features of a given DNA sequence like K-mers, or other genomic signature(s). Most of these current methods for binning have two major issues: they do not work well with short sequences and closely related genomes. In this paper we propose new machine learning related predictive DNA sequence feature selection algorithms to solve binning problems in more accurate and efficient ways. In this work we use Oligonucleotide frequencies from 2-mers to 4-mers as features to differentiate between sequences. 2-mers produces 16 features, 3-mers produces 64 features and 4-mers produces 256 features. We did not use feature higher than 4-mers as the number of feature increases exponentially and for 5-mers the number of feature would be 1024 features. We found out that the 4-mers produces better results than 2-mers and 3-mers. The data used in this work has an average length of 250, 500, 1000, and 2000 base pairs. Experimental results of the proposed algorithms are presented to show the potential value of the proposed methods. The proposed algorithm accuracy is tested on a variety of data sets and the classification/prediction accuracy achieved is between 78% – 99% for vari- us simulated data sets using Random forest classifier and 37% – 95% using Naïve Bayes classifier. Random forest Classifier did better in classification in all the dataset compared to Naïve Bayes.

机译：元基因组学是对直接从自然环境中收集的微生物的研究。元基因组学研究使用直接从自然环境中获得的DNA片段，通过全基因组shot弹枪（WGS）测序。将从全基因组shot弹枪获得的随机片段测序到基于分类单元的组中，这被称为分箱。当前，有两种不同的合并方法：序列相似性方法和序列组成方法。序列相似性方法通常基于与已知基因组（如BLAST或MEGAN）的序列比对。由于当前数据库中仅一小部分物种可用，因此相似性方法无法产生良好的结果。随着给定的生物数据库的增长，搜索的复杂性也将增加。序列组成方法基于给定DNA序列（例如K-mers）或其他基因组特征的组成特征。这些当前的装箱方法大多数都存在两个主要问题：它们不适用于短序列和紧密相关的基因组。在本文中，我们提出了一种新的与机器学习相关的预测性DNA序列特征选择算法，以更准确和有效的方式解决装箱问题。在这项工作中，我们使用从2聚体到4聚体的寡核苷酸频率作为区分序列的特征。 2聚体产生16个特征，3聚体产生64个特征，而4聚体产生256个特征。我们不使用高于4个单体的特征，因为特征数量呈指数增加，而对于5个单体，特征数量将为1024个特征。我们发现4-聚体比2-聚体和3-聚体产生更好的结果。这项工作中使用的数据的平均长度为250、500、1000和2000个碱基对。提出的算法的实验结果表明了所提方法的潜在价值。所提出的算法准确性在各种数据集上进行了测试，使用随机森林分类器对各种模拟数据集的分类/预测准确性在78％– 99％之间，而使用朴素贝叶斯分类器则在37％– 95％之间。与朴素贝叶斯相比，随机森林分类器在所有数据集中的分类方面表现更好。

著录项

来源
《IEEE International Conference on Computational lntelligence and Virtual Environments for Measurement Systems and Applications》|2013年|191-196|共6页
会议地点
作者
Saghir Helal; Megherbi Dalila B.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Binning; Bioinformatics; Computational intelligence; Machine learning; Metagenomics; Next generation Sequencing; Pattern Classification; Random forest; Reduction methods; bagged decision tree; forwaord sequential feature selection; ttest;

机译：装箱;生物信息学;计算智能;机器学习;宏基因组学;下一代测序;模式分类;随机森林;归约方法;袋装决策树;前序特征选择; ttest;

相似文献

外文文献
中文文献
专利

1. Deep Learning-Based Efficient Model Development for Phishing Detection Using Random Forest and BLSTM Classifiers [J] . Shan Wang, Sulaiman Khan, Chuyi Xu, Complexity . 2020,第1期

机译：基于深度学习的高效模型开发，用于使用随机林和BLSTM分类器的网络钓鱼检测
2. Efficient On-Chip Randomness Testing Utilizing Machine Learning Techniques [J] . Mrazek Vojtech, Sekanina Lukas, Dobai Roland, IEEE transactions on very large scale integration (VLSI) systems . 2019,第12期

机译：利用机器学习技术的高效芯片上随机性测试
3. Comparative study of support vector machines and random forests machine learning algorithms on credit operation [J] . Teles Germanno, Rodrigues Joel J. P. C., Rabelo Ricardo A. L., Software, practice & experience . 2021,第12期

机译：支持向量机和随机林机器学习算法对比较研究信用操作
4. An Efficient Comparative Machine Learning-based Metagenomics Binning Technique Via Using Random Forest [C] . Helal Saghir, Dalila B. Megherbi IEEE International Conference on Computational lntelligence and Virtual Environments for Measurement Systems and Applications . 2013

机译：基于高效的比较机学习型散塞技术通过使用随机林
5. Comparative classification of prostate cancer data using the Support Vector Machine, Random Forest, DualKS and k-Nearest Neighbours. [D] . Sakouvogui, Kekoura. 2015

机译：使用支持向量机，Random Forest，DualKS和k-Nearest邻居对前列腺癌数据进行比较分类。
6. Machine learning-based random forest predicts anastomotic leakage after anterior resection for rectal cancer [O] . Rongbo Wen, Kuo Zheng, Qihang Zhang, 2021

机译：基于机器学习的随机森林预测直肠癌前切除术后的吻合渗漏
7. Comparative Prediction Performance with Support Vector Machine and Random Forest Classification Techniques [O] . Ashfaq Ahmed K, Sultan Aljahdali, Syed Naimatullah Hussain 2014

机译：基于支持向量机和随机森林分类技术的比较预测性能

An efficient comparative machine learning-based metagenomics binning technique via using Random forest

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅