首页> 外文OA文献 >An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters
【2h】

An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters

机译:MPI-CUDA在多GPU群集上大规模并行不可压缩流量计算的实现

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi-GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. In this study, we pursue mixed MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations.
机译:具有多核体系结构的现代图形处理单元(GPU)已经成为通用并行计算平台,可以极大地加速仿真科学的应用。虽然具有几个峰值运算能力TeraFLOPS的多GPU工作站可用于加速计算问题,但更大的问题需要更多的资源。现在,传统的中央处理器(CPU)集群在每个计算节点中增加了多个GPU,以解决较大的问题。具有深内存层次结构的多GPU群集的异构体系结构在开发可扩展和高效的仿真代码时面临独特的挑战。在本研究中,我们追求混合MPI-CUDA的实现,并研究了三种策略来探索国家超级计算应用中心(NCSA)上林肯·特斯拉集群上不可压缩流计算的效率和可伸缩性。我们利用MPI和CUDA编程的一些高级功能,将GPU数据传输和MPI通信与GPU上的计算重叠。我们使用128个GPU和总共30,720个处理元素在NCSA Lincoln Tesla集群的64个节点上维持大约2.4 TeraFLOPS。我们的结果表明,多GPU群集可以大大加速计算流体动力学(CFD)仿真。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号