首页> 外文OA文献 >Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods
【2h】

Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods

机译:基于细菌全基因组的系统发育:构建新的基准数据集并评估一些现有方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

BackgroundWhole genome sequencing (WGS) is increasingly used in diagnostics and surveillance of infectious diseases. A major application for WGS is to use the data for identifying outbreak clusters, and there is therefore a need for methods that can accurately and efficiently infer phylogenies from sequencing reads. In the present study we describe a new dataset that we have created for the purpose of benchmarking such WGS-based methods for epidemiological data, and also present an analysis where we use the data to compare the performance of some current methods.ResultsOur aim was to create a benchmark data set that mimics sequencing data of the sort that might be collected during an outbreak of an infectious disease. This was achieved by letting an E. coli hypermutator strain grow in the lab for 8 consecutive days, each day splitting the culture in two while also collecting samples for sequencing. The result is a data set consisting of 101 whole genome sequences with known phylogenetic relationship. Among the sequenced samples 51 correspond to internal nodes in the phylogeny because they are ancestral, while the remaining 50 correspond to leaves.We also used the newly created data set to compare three different online available methods that infer phylogenies from whole-genome sequencing reads: NDtree, CSI Phylogeny and REALPHY. One complication when comparing the output of these methods with the known phylogeny is that phylogenetic methods typically build trees where all observed sequences are placed as leafs, even though some of them are in fact ancestral. We therefore devised a method for post processing the inferred trees by collapsing short branches (thus relocating some leafs to internal nodes), and also present two new measures of tree similarity that takes into account the identity of both internal and leaf nodes.ConclusionsBased on this analysis we find that, among the investigated methods, CSI Phylogeny had the best performance, correctly identifying 73% of all branches in the tree and 71% of all clades.We have made all data from this experiment (raw sequencing reads, consensus whole-genome sequences, as well as descriptions of the known phylogeny in a variety of formats) publicly available, with the hope that other groups may find this data useful for benchmarking and exploring the performance of epidemiological methods. All data is freely available at: https://cge.cbs.dtu.dk/services/evolution_data.php.
机译:背景全基因组测序(WGS)越来越多地用于诊断和监测传染病。 WGS的主要应用是使用数据来识别暴发簇,因此需要一种可以从测序读数中准确有效地推断系统发育的方法。在本研究中,我们描述了一个新的数据集,该数据集是为了对此类基于WGS的流行病学方法进行基准测试而创建的,并且还提供了一种分析方法,其中我们使用这些数据来比较一些当前方法的性能。创建一个基准数据集,以模拟在传染病爆发期间可能收集的那种测序数据。这是通过让大肠杆菌超突变菌株在实验室中连续生长8天来实现的,每天将培养物分成两部分,同时还收集样本进行测序。结果是一个包含101个具有已知系统发生关系的全基因组序列的数据集。在测序的样本中,有51个对应于系统发育中的内部节点,因为它们是祖先的,而其余的50个对应于叶片。我们还使用新创建的数据集来比较从全基因组测序读取中推断系统发育的三种不同的在线可用方法: NDtree,CSI系统发育和REALPHY。将这些方法的输出与已知的系统发育进行比较时,一个复杂之处是,系统发育的方法通常会构建树木,其中所有观察到的序列都像叶子一样放置,即使其中一些实际上是祖先的。因此,我们设计了一种方法,通过折叠短分支(从而将一些叶子重新定位到内部节点)来对推断出的树进行后处理,并且还提出了两种新的树相似度度量,其中考虑了内部和叶子节点的身份。分析发现,在所研究的方法中,CSI系统发育具有最佳性能,可以正确识别出树中所有树枝的73%和所有枝条的71%。我们从该实验中获得了所有数据(原始测序读取,共识一致,公开提供基因组序列,以及各种格式的已知系统发育的描述),希望其他小组可以发现此数据可用于基准和探索流行病学方法的性能。所有数据均可从以下网址免费获得:https://cge.cbs.dtu.dk/services/evolution_data.php。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号