首页> 外文会议>AIAA aviation forum;AIAA fluid dynamics conference >Multilevel Parallelism for CFD Codes on Heterogeneous Platforms
【24h】

Multilevel Parallelism for CFD Codes on Heterogeneous Platforms

机译:异构平台上CFD代码的多级并行

获取原文

摘要

High-level parallel programming approaches have recently become popular in complex fluid dynamics research since they are cross-platform and easy to implement. OpenACC is a high-level directive-based parallel library for offloading program execution onto a graphics processing unit (GPU). Program development for both CPU and GPU platforms can be effectively unified using the capability of portable programming with OpenACC. The directive-based model stands in contrast with the detailed implementation model of CUDA and OpenCL that suffers from poor portability and maintainability with changes in the accelerator hardware. In this work, we put effort toward drawing some outlines on efficiently annotating the base serial CPU version of the CFD code to achieve acceptable multithreaded computational performance on the GPU. An Artificial Compressibility Method (ACM) is used for studying steady-state incompressible 2D and 3D heat and fluid flows. We study loop scheduling with careful attention given to the limited on-chip memory available. The possibility of asynchronous calculations in the CFD algorithm is investigated to increase the computing performance. The PGI Fortran 15.7 compiler is used for the following study. Memory management and loop scheduling considerations result in approximately 20% speedup in the GPU computing performance from the base OpenACC implementation. A performance analysis of the thermal flow solver demonstrates that a single NVIDIA Tesla C2075 GPU card exhibits comparable speed to 8 cores on a dual-socket Xeon E5-2687W processor workstation. A multi-GPU performance analysis is performed using NVIDIA Tesla M2050 GPUs on the Virginia Tech HokieSpeed supercomputer. A weak scalability analysis demonstrates up to 92% efficiency with 32 GPUs. With 16 GPUs the thermal flow solver works up to 190 times faster than a single core of a Xeon E5645 processor, 10 times faster than a single GPU, and 12 times faster than 16 CPUs distributed on separate sockets using an MPI multi-CPU implementation of the code.
机译:高级并行编程方法由于跨平台且易于实现,因此在复杂的流体动力学研究中最近变得很流行。 OpenACC是基于指令的高级并行库,用于将程序执行卸载到图形处理单元(GPU)上。使用带有OpenACC的可移植编程功能,可以有效地统一CPU和GPU平台的程序开发。基于指令的模型与CUDA和OpenCL的详细实现模型形成鲜明对比,后者由于加速器硬件的更改而具有较差的可移植性和可维护性。在这项工作中,我们将努力画出一些轮廓,以有效地注释CFD代码的基本串行CPU版本,以在GPU上实现可接受的多线程计算性能。人工可压缩性方法(ACM)用于研究稳态不可压缩2D和3D热量和流体流动。我们研究循环调度时要特别注意可用的有限片上存储器。为了提高计算性能,研究了CFD算法中异步计算的可能性。 PGI Fortran 15.7编译器用于以下研究。内存管理和循环调度注意事项使基本OpenACC实施的GPU计算性能提高了大约20%。对热流求解器的性能分析表明,单个NVIDIA Tesla C2075 GPU卡在双插槽Xeon E5-2687W处理器工作站上具有与8个内核相当的速度。使用Virginia Tech HokieSpeed超级计算机上的NVIDIA Tesla M2050 GPU执行多GPU性能分析。弱的可伸缩性分析表明,使用32个GPU时,效率高达92%。使用16个GPU,热流求解器的工作速度比至强E5645处理器的单个内核快190倍,比单个GPU快10倍,比使用MPI多CPU实现的分布在单独插槽上的16个CPU快12倍。代码。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号