Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

Verena Dully; Thomas A. Wilding; Timo Mühlhaus; Thorsten Stoeck

首页> 外文期刊>Computational and Structural Biotechnology Journal >Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

【24h】

Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

机译：使用监督机器学习确定基于EDNA的海洋生物制剂的充分预测类别的最小扩增子序列深度

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Environmental DNA metabarcoding is a powerful approach for use in biomonitoring and impact assessments. Amplicon-based eDNA sequence data are characteristically highly divergent in sequencing depth (total reads per sample) as influenced inter alia by the number of samples simultaneously analyzed per sequencing run. The random forest (RF) machine learning algorithm has been successfully employed to accurately classify unknown samples into monitoring categories. To employ RF to eDNA data, and avoid sequencing-depth artifacts, sequence data across samples are normalized using rarefaction, a process that inherently loses information. The aim of this study was to inform future sampling designs in terms of the relationship between sampling depth and RF accuracy. We analyzed three published and one new bacterial amplicon datasets, using a RF, based initially on the maximal rarefied data available (minimum mean of??30,000 reads across all datasets) to give our baseline performance. We then evaluated the RF classification success based on increasingly rarefied datasets. We found that extreme to moderate rarefaction (50–5000 sequences per sample) was sufficient to achieve prediction performance commensurate to the full data, depending on the classification task. We did not find that the number of classification classes, data balance across classes, or the total number of sequences or samples, were associated with predictive accuracy. We identified the ability of the training data to adequately characterize the classes being mapped as the most important criterion and discuss how this finding can inform future sampling design for eDNA based biomonitoring to reduce costs and computation time.

机译：环境DNA Metabarcoding是一种用于生物监测和影响评估的强大方法。基于扩增子的EDNA序列数据在测序深度（每个样品的总读数）中是特性高度发散的，因为通过每个测序运行同时分析的样品的数量尤其影响。随机森林（RF）机器学习算法已成功用于将未知样本准确地分类为监测类别。为了使用RF到EDNA数据，避免测序深度伪像，使用稀疏标准化样本的序列数据，该过程固有地失去信息。本研究的目的是在采样深度和RF精度之间的关系方面通知未来的采样设计。我们分析了三个已发表的和一个新的细菌扩增子数据集，最初基于最大稀土数据（最小均值？＆ 30,000遍布所有数据集的读数）以提供我们的基准性能。然后，我们基于越来越稀薄的数据集评估了RF分类成功。我们发现极致的稀疏（每个样本的50-5000个序列）足以实现对完整数据的预测性能，具体取决于分类任务。我们没有发现分类类数，类跨类的数据余额或序列或样本总数，与预测精度相关联。我们确定了培训数据以充分表征所映射为最重要的标准的课程的能力，并讨论该发现如何为未来的基于EDNA的生物监测方式通知未来的采样设计，以降低成本和计算时间。

著录项

来源
《Computational and Structural Biotechnology Journal 》 |2021年第a期| 共13页
作者
Verena Dully; Thomas A. Wilding; Timo Mühlhaus; Thorsten Stoeck;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物结构理论 ;
关键词

相似文献

外文文献
中文文献
专利

1. Better Living Through Algorithms: Machine Learning Algorithms to Identify Adequate Immunosuppression and Predict Important Clinical Outcomes are Superior to Analysis of Thiopurine Metabolite Chemistry [J] . Waljee Akbar, Sauder Kay, Patel Anand, The American Journal of Gastroenterology . 2014 ,第Suppla2期

机译：通过算法改善生活质量：机器学习算法可识别足够的免疫抑制并预测重要的临床结果优于硫嘌呤代谢物化学分析
2. Predicting the Ecological Quality Status of Marine Environments from eDNA Metabarcoding Data Using Supervised Machine Learning [J] . Cordier Tristan, Esling Philippe, Lejzerowicz Franck, Environmental Science & Technology . 2017 ,第16期

机译：使用有监督的机器学习从eDNA元条形码数据预测海洋环境的生态质量状况
3. Supervised machine learning outperforms taxonomy-based environmental DNA metabarcoding applied to biomonitoring [J] . Cordier Tristan, Forster Dominik, Dufresne Yoann, Molecular ecology resources . 2018 ,第6期

机译：监督机器学习优于基于分类的基于分类的环境DNA地区，适用于生物监测
4. Machine Learning-Based Approaches Identify a Key Physicochemical Property for Accurately Predicting Polyadenlylation Signals in Genomic Sequences [C] . HaiBo Cui, Jia Wang International conference on intelligent computing . 2013

机译：基于机器学习的方法可识别关键的理化性质，以准确预测基因组序列中的聚腺苷酸化信号
5. DeePSLiM: A Deep Learning Approach to Identify Predictive Short-Linear Motifs for Protein Sequence Classification [D] . Filip, Alexandru. 2020

机译：Deepslim：一种识别蛋白质序列分类预测短线性图案的深度学习方法
6. Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning [O] . Verena Dully, Thomas A. Wilding, Timo Mühlhaus, 2021

机译：使用受监管机器学习确定基于EDNA的海洋生物生物监逻中的最小扩增子序列深度的最小扩增子序列深度
7. Predicting classifications in marine biomonitoring with supervised machine learning: how much data is required? [O] . Verena Dully, Tom Wilding, Timo Mühlhaus, 2021

机译：预测监督机器学习海洋生物监控的分类：需要多少数据？

Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

摘要

著录项

相似文献

相关主题

期刊订阅