【24h】

A Novel Heuristic for Data Distribution in Massively Parallel Phylogenetic Inference Using Site Repeats

机译:一种新的启发式方法,用于使用站点重复进行大规模并行系统发育推断中的数据分布

获取原文

摘要

Continuous advances in molecular sequencing technologies now allow for inferring evolutionary trees (phylogenies) on supercomputers that comprise hundreds to thousands of species at the whole-transcriptome or whole-genome level. The phylogenetic likelihood function (PLF) consumes 90-95% of total execution time in such analyses and is therefore typically parallelized. Recently, the site repeats (SR) technique for substantially accelerating the PLF has been introduced. It identifies repeating patterns in parts of the likelihood computation and omits the respective redundant calculations to save time and space. However, the SR technique induces a parallel load imbalance. In this paper, we introduce a novel randomized data distribution algorithm to improve load balance (RDDA) for SR-based likelihood calculations. The algorithm is available as open-source code, induces minimal run-time overhead, and yields up to 25% run time improvements on empirical datasets and up to 50% for a synthetic, worst-case scenario. This improvement is substantial as current evolutionary data analyses may require tens of millions of core hours on supercomputer systems.
机译:分子测序技术的不断发展现在允许在超级计算机上推断进化树(系统树),该树在整个转录组或整个基因组水平上包含成百上千种。系统发育似然函数(PLF)在此类分析中消耗了总执行时间的90-95%,因此通常被并行化。近来,已经引入了用于实质上加速PLF的位点重复(SR)技术。它在部分似然计算中识别出重复模式,并省略了相应的冗余计算,以节省时间和空间。但是,SR技术会引起并行的负载不平衡。在本文中,我们介绍了一种新颖的随机数据分配算法,以提高基于SR的似然性计算的负载平衡(RDDA)。该算法可作为开放源代码使用,可将运行时开销降至最低,并且在经验数据集上的运行时改进最多可提高25%,而在合成的最坏情况下则可提高50%。这种改进是实质性的,因为当前的进化数据分析可能需要超级计算机系统上数千万个核心小时。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号