首页> 外文期刊>Scientific programming >Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters
【24h】

Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

机译:用于多GPU HPC集群的CFD应用的混合MPI和CUDA并行化

获取原文
           

摘要

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM?+?UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the K?ω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.
机译:图形处理单元(GPU)具有强大的浮点功能和数据并行性的高存储器带宽,并已广泛用于高性能计算(HPC)。计算统一设备架构(CUDA)用作GPU的并行计算平台和编程模型,以降低编程的复杂性。可编程GPU在计算流体动力学(CFD)应用中遭受。在这项工作中,我们提出了一种在多GPU HPC集群上的CFD应用程序传递接口和CUDA的混合并行算法。 AUSM?+?上行方案和三步runge-Kutta方法分别用于空间离散化和时间离散化。湍流溶液通过kωsst两方程模型求解。 CPU仅管理GPU和通信的执行,并且GPU负责数据处理。并行执行和内存访问优化用于优化基于GPU的CFD代码。我们提出了一种非阻塞通信方法来通过创建两个CUDA流来完全重叠GPU计算,CPU_CPU通信和CPU_GPU数据传输。此外,一维域分解方法用于平衡GPU之间的工作量。最后,我们用平板上的可压缩湍流进行混合并联算法。讨论了单个GPU实现的性能和多GPU集群的可扩展性。性能测量结果表明,多GPU并行化可以实现相对于基于CPU的并行计算超过36次的加速,并行算法具有良好的可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号