【24h】

GPU-Accelerated Large-Scale Genome Assembly

机译:GPU加速的大规模基因组组装

获取原文

摘要

Spurred by a widening gap between hardware accelerators and traditional processors, numerous bioinformatics applications have harnessed the computing power of GPUs and reported substantial performance improvements compared to their CPU-based counterparts. However, most of these GPU-based applications only focus on the read alignment problem, while the field of de novo assembly still relies mostly on CPU-based solutions. This is primarily due to the nature of the assembly workload which is not only compute-intensive but also extremely data-intensive. Such workloads require large memories, making it difficult to adapt them to use GPUs with their limited memory capacities. To the best of our knowledge, no GPU-based assembler reported in the recent literature has attempted to assemble datasets larger than a few tens of gigabytes, whereas real sequence datasets are often several hundreds of gigabytes in size. In this paper, we present a new GPU-accelerated genome assembler called LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps. LaSAGNA can also run on multiple GPUs across multiple compute nodes connected by a high-speed network to expedite the assembly process. To utilize the limited memory on GPUs efficiently, LaSAGNA uses a semi-streaming approach that makes at most a logarithmic number of passes over the input data based on the available memory. Moreover, we propose a two-level streaming model, from disk to host memory and from host memory to device memory, to minimize disk I/O. Using LaSAGNA, we can assemble a 400 GB human genome dataset on a single NVIDIA K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s.
机译:在硬件加速器和传统处理器之间不断扩大的差距的刺激下,许多生物信息学应用程序已经利用了GPU的计算能力,并且与基于CPU的同类应用程序相比,其性能得到了显着改善。但是,大多数这些基于GPU的应用程序仅关注读取对齐问题,而从头组装领域仍然主要依赖于基于CPU的解决方案。这主要是由于组装工作负载的性质,该工作负载不仅需要大量计算,而且还需要大量数据。此类工作负载需要大容量内存,因此很难使其适应内存容量有限的GPU的使用。据我们所知,最近文献中没有报道过基于GPU的汇编器试图汇编大于几十GB的数据集,而实际序列数据集的大小通常为数百GB。在本文中,我们提出了一种称为LaSAGNA的新型GPU加速基因组组装器,该组装器可以使用单个GPU通过从近似全对重叠构建字符串图来使用单个GPU组装大规模序列数据集。 LaSAGNA还可以在由高速网络连接的多个计算节点上的多个GPU上运行,以加快组装过程。为了有效地利用GPU上有限的内存,LaSAGNA使用半流方法,该方法基于可用内存最多对输入数据进行对数遍历。此外,我们提出了一个从磁盘到主机内存以及从主机内存到设备内存的两级流模型,以最大程度地减少磁盘I / O。使用LaSAGNA,我们可以在一个小时内在一个NVIDIA K40 GPU上组装一个400 GB的人类基因组数据集,而在一个8节点的NVIDIA K20s集群上仅需5个多小时即可完成组装。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号