首页> 外文会议>AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition >An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters
【24h】

An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters

机译:用于多GPU集群的大规模平行不可压缩流量计算的MPI-CUDA实现

获取原文

摘要

Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi-GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations.
机译:具有许多核心架构的现代图形处理单元(GPU)已成为通用并行计算平台,可以卓越地加速仿真科学应用程序。虽然具有多个峰值计算能力的多GPU工作站可用于加速计算问题,但更大的问题需要更多的资源。现在在每个计算节点中使用多个GPU来增强中央处理单元(CPU)的传统集群以解决大问题。具有深度存储层级的多GPU群集的异构架构在开发可扩展和高效的仿真代码方面创造了独特的挑战。在这项研究中,我们追求混合的MPI-CUDA实施,并调查三种策略,探讨了全国超级计算申请(NCSA)林肯特斯拉集群上不可压缩流量计算的效率和可扩展性。我们利用MPI和CUDA编程的一些高级功能,将GPU数据传输和MPI通信与GPU的计算重叠。我们使用128个GPU在NCSA LICOLN TESLA集群的64个节点上维持大约2.4 TERAFLOPS,总共30,720个处理元件。我们的结果表明,多GPU集群可以基本上加速计算流体动力学(CFD)模拟。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号