光滑粒子流体动力学方法的高效异构加速

王迎瑞; 黎雷生; 王景焘; 田荣

摘要

目前,光滑粒子流体动力学方法的GPU加速几乎都是基丁简化的Euler控制方程,完整的Navier-Stokes 方程的GPU实现非常少,且对其困难、优化策略、加速效果的描述较为模糊.另一方面,CPU GPU协同方式深刻影响着异构平台的整体效率,GPU加速模型还有待进一步探讨.文中的目的是将自主开发的、基于Navier-Stokes方程的SPH应用程序petaPar在异构平台上进行高效加速.文中首先从数学公式的角度分析了Euler方程和NavierStokes方程的计算特征,并总结了Navier-Stokes方程在GPU加速中面临的困难.由于Euler方程只含有简单的标量和向量计算,是典型的适合GPU的计算密集轻量级kernel;而完整形式的Navier-Stokes方程涉及复杂的材料本构和大量张量计算,需要面对GPU上大kernel带来的系列问题,如访存压力、cache不足、低占用率、寄存器溢出等.文中通过减少粒子属性、提取操作到粒子更新、利用粒子的重用度、最大化GPU占用率等策略对Navier-Stokes 方程的粒子交互kernel进行优化,具体实现见5.1节.同时,文中调研了三种GPU加速模型:热点加速、全GPU加速以及对等协同,分析了其开发投入、应用范围、理论加速比等,并深入探讨了对等协同模型的通信优化策略.由于通信粒子的不连续分布,GPU端通信粒子的抽取、插入、删除等操作本质上是对不连续内存的并行操作,会严重影响CPU-GPU的同步效果,而相关文献对此问题没有阐述.我们通过改进粒子索引规则解决此问题:粒子排序时不仅考虑网格编号,还要考虑网格类型,具体实现见5.2.3节.基于Euler方程和Navier-Stokes方程实现并分析了三种GPU加速模型.测试结果显示,三种模型下,Euler方程分别获得了8倍、33倍、36倍的加速,Navier-Stokes方程分别获得了6倍、15倍、20倍的加速.全GPU加速均突破了热点加速的加速比理论上限,对等协同比之全GPU加速又可以获得进一步提高.特别是对于Navier-Stokes方程,采用文中的kernel优化策略及对等协同模型,最终在异构平台上实现了20倍的整体加速.针对Navier-Stokes方程的对等协同版本这一应用范围最广、加速效果最好的实现,在Titan 超级计算机的6个和1024个异构计算节点上进行了强、弱可扩展性测试,分别获得了67.1％和75.2％的并行效率.%The existing GPU-accelerated codes of Smoothed Particle Hydrodynamics method mostly focus on the simplified Euler equations rather than the complete Navier-Stokes equations.Besides,the current GPU acceleration models seem to be not optimal.There is a question that needs to be answered:what is the most efficient way to couple CPUs and GPUs in one SPH application code,especially for the Navier-Stokes equations.In this paper,we analysed the computing features of Euler equations and Navier-Stokes equations mathematically,and summed up the difficulties in GPU accelerating of Navier-Stokes kernels.The Euler kernel is light-weight since only simple scalar and vector calculations are involved.However,the Navier-Stokes equations involve complicated constitutive models and tensor computations,resulting in the big kernel issues on GPU,such as heavy memory access,low occupancy and register spilling,etc.Kernel optimization strategies of reducing particle properties,extracting operations from interaction kernel to updating kernel,utilizing particle reusability,and maximizing the GPU occupancy are introduced,as described in chapter 5.1.Meanwhile,we investigated three GPU acceleration models:hot-spot acceleration (to run hotspots on GPU),GPU-entire (to finish the whole computing process on GPU),and peer2peer acceleration (to treat CPU and GPU as equivalent processors).The three models are analyzed from the perspectives of development cost,application scope,and theoretical speedup.And the communication optimization strategies of peer2peer model are addressed in detail.Because of the discontinuous distribution of communication particles,the extracting,inserting and deleting of them on GPU are actually parallel operations over discontinuous memory,which have serious influence on the CPU-GPU synchronization but no relevant research in literature.We solved the problem by improving the particle indexing rule,to consider not only cell index but also cell type when ordering particles,as described in chapter 5.2.3.The three acceleration models are implemented in a large-scale SPH simulation code,petaPar,for both the simplified Euler equations and the complete Navier-Stokes equations.Based on the numerical experiments,8x,33x,36x speedups for the Euler equations and 6x,15x,20x speedups for the Navier-Stokes equations at the application level,with reference to single-core CPU,are observed for the three coupling models,respectively.The GPU-entire model breaks the upper speedup limit of the hot-spot model (12.5x for hot-spot ratio of 92％) and the peer2peer model shows the best acceleration.With the kernel optimization strategies and the peer2peer model,the GPU-accelerated SPH code with Navier-Stokes equations finally achieved 20x speedups on a heterogeneous computing node.Scalability tests of the peer2peer implementation of NavierStokes equations are carried out on Titan@ORNL.The parallel efficiency of 67.1％ and 75.2％are achieved for the strong scalability tests on 6 CPU-GPU nodes and the week scalability tests on 1024 CPU-GPU nodes,respectively.

光滑粒子流体动力学方法的高效异构加速

摘要

著录项

相似文献

相关主题

期刊订阅