首页> 外文期刊>BMC Bioinformatics >Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees
【24h】

Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees

机译:大树基于似然性的系统发生演算法的算法,数据结构和数值

获取原文
获取外文期刊封面目录资料

摘要

Background The rapid accumulation of molecular sequence data, driven by novel wet-lab sequencing technologies, poses new challenges for large-scale maximum likelihood-based phylogenetic analyses on trees with more than 30,000 taxa and several genes. The three main computational challenges are: numerical stability, the scalability of search algorithms, and the high memory requirements for computing the likelihood. Results We introduce methods for solving these three key problems and provide respective proof-of-concept implementations in RAxML. The mechanisms presented here are not RAxML-specific and can thus be applied to any likelihood-based (Bayesian or maximum likelihood) tree inference program. We develop a new search strategy that can reduce the time required for tree inferences by more than 50% while yielding equally good trees (in the statistical sense) for well-chosen starting trees. We present an adaptation of the Subtree Equality Vector technique for phylogenomic datasets with missing data (already available in RAxML v728) that can reduce execution times and memory requirements by up to 50%. Finally, we discuss issues pertaining to the numerical stability of the Γ model of rate heterogeneity on very large trees and argue in favor of rate heterogeneity models that use a single rate or rate category for each site to resolve these problems. Conclusions We address three major issues pertaining to large scale tree reconstruction under maximum likelihood and propose respective solutions. Respective proof-of-concept/production-level implementations of our ideas are made available as open-source code.
机译:背景技术在新型湿实验室测序技术的推动下,分子序列数据的快速积累为具有30,000个分类单元和多个基因的树木进行大规模基于最大似然性的系统发育分析提出了新的挑战。三个主要的计算挑战是:数值稳定性,搜索算法的可伸缩性以及计算可能性的高内存需求。结果我们介绍了解决这三个关键问题的方法,并在RAxML中提供了相应的概念验证实现。这里介绍的机制不是特定于RAxML的,因此可以应用于任何基于似然性(贝叶斯或最大似然性)的树推理程序。我们开发了一种新的搜索策略,可以将树推断所需的时间减少50%以上,同时为精心挑选的起始树生成同等质量的树(从统计意义上来说)。我们针对数据丢失的系统生物学数据集(适用于RAxML v728中)提出了子树平等矢量技术的一种改编,它可以将执行时间和内存需求减少多达50%。最后,我们讨论与非常大树上速率异质性Γ模型的数值稳定性有关的问题,并主张采用速率异质性模型,该模型对每个站点使用单一速率或速率类别来解决这些问题。结论我们解决了与最大可能性下的大规模树重建有关的三个主要问题,并提出了相应的解决方案。我们的想法的相应概念验证/生产级别实现可作为开源代码使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号