首页> 外文会议>Euro-par 2016: parallel processing workshops >Improving Bioinformatics Analysis of Large Sequence Datasets Parallelizing Tools for Population Genomics
【24h】

Improving Bioinformatics Analysis of Large Sequence Datasets Parallelizing Tools for Population Genomics

机译:改善人口基因组学大序列数据集并行化工具的生物信息学分析

获取原文
获取原文并翻译 | 示例

摘要

Next-generation sequencing (NGS) technologies initiated a revolution in genomics, producing massive amounts of biological data and the consequent need for adapting current computing infrastructures. Multiple alignment of genomes, analysis of variants or phylogenetic tree construction, with quadratic polynomial complexity in the best case are tools that can take days or weeks to complete in conventional computers. Most of these analysis, involving several tools integrated in workflows, present the possibility of dividing the computational load in independent tasks allowing parallel execution. Determining adequate load balancing, data partitioning, granularity and I/O tuning are key factors for achieving suitable speedups. In this paper we present a coarse-grain parallelization of GH caller (Genotype/Haplotype caller), a tool used in population genomics workflows that performs a probabilistic identification process to account for the frequency of variants present between population individuals. It implements a master-worker model, using the standard Message Passing Interface (MPI), and concurrently and iteratively distributes the data among the available worker processes by mapping subsets of data and leaving the orchestration to the master process. Our results show a performance gain factor of 260x using 64 processes and additional optimizations with regard to the initial non-parallelized version.
机译:下一代测序(NGS)技术引发了基因组学的一场革命,产生了大量的生物数据,因此需要适应当前的计算基础架构。在最佳情况下,基因组的多重比对,变异分析或系统树的构建以及二次多项式复杂性是在常规计算机中可能需要几天或几周才能完成的工具。这些分析中的大多数都涉及工作流中集成的几种工具,它们提出了将计算负荷划分为独立任务的可能性,从而允许并行执行。确定适当的负载平衡,数据分区,粒度和I / O调整是实现适当加速的关键因素。在本文中,我们介绍了GH调用者(Genotype / Haplotype调用者)的粗粒度并行化,这是一种在人口基因组学工作流程中使用的工具,该工具执行概率识别过程,以解决种群个体之间存在的变异频率。它使用标准的消息传递接口(MPI)实现主工作模型,并通过映射数据的子集并将编排留给主过程,在可用的工作进程中同时并迭代地分布数据。我们的结果表明,使用64个过程以及相对于初始非并行版本的其他优化,性能提高了260倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号