首页> 外文会议>International Conference on Application-specific Systems, Architectures and Processors >Unleashing the performance potential of CPU-GPU platforms for the 3D atmospheric Euler solver
【24h】

Unleashing the performance potential of CPU-GPU platforms for the 3D atmospheric Euler solver

机译:释放3D大气Euler求解器的CPU-GPU平台的性能潜力

获取原文

摘要

As a traditional application on various supercomputers, atmospheric modeling has long been suffering from the low performance efficiency. In this paper, we pick the 3D Euler equation solver (the most essential dynamic component for a non-hydrostatic atmospheric model) as the target application, and explore the maximum performance efficiency that can be achieved on CPU-GPU hybrid architectures. Besides presenting the suitable hybrid domain decomposition methodology and taking proper usage of tuning techniques for both the CPU and GPU parts, we further propose a novel GPU tuning technique, namely the customizable data caching mechanism with thread warp rescheduling scheme, which is specifically designed for the Euler solver. Combining all the optimizing approaches together, remarkable performance boost has been achieved on mainstream GPU architectures including Tesla Fermi C2050, K20×, K40 and K80. Especially, on the latest Tesla K80, we demonstrate a 31.64× speedup over the performance of 12-core E5-2697 CPU. In addition, based on a hybrid CPU-GPU node with two 12-core E5-2697 CPUs and two Tesla K80 GPUs, a sustained double-precision performance of 1.04 Tflops (16% of the peak) is achieved, which is remarkably higher than the efficiency of similar optimizing tasks based on heterogeneous platforms (strictly less than 10%, as demonstrated in the related work). In addition, a nearly linear weak scaling efficiency is achieved which demonstrate the effectiveness of our domain decomposition method.
机译:作为各种超级计算机上的传统应用程序,大气建模长期以来一直遭受着性能效率低下的困扰。在本文中,我们选择3D Euler方程求解器(非静压大气模型的最基本动态组件)作为目标应用,并探索在CPU-GPU混合体系结构上可以实现的最大性能效率。除了提供合适的混合域分解方法并适当使用CPU和GPU部件的调整技术外,我们还提出了一种新颖的GPU调整技术,即带有线程扭曲重新计划方案的可自定义数据缓存机制,该技术专门针对欧拉求解器。将所有优化方法结合在一起,在主流GPU架构(包括Tesla Fermi C2050,K20×,K40和K80)上实现了显着的性能提升。特别是,在最新的Tesla K80上,我们证明了12核E5-2697 CPU的性能提高了31.64倍。此外,基于具有两个12核E5-2697 CPU和两个Tesla K80 GPU的混合CPU-GPU节点,可实现1.04 Tflops(峰值的16%)的持续双精​​度性能,明显高于基于异构平台的类似优化任务的效率(严格小于10%,如相关工作所示)。另外,实现了接近线性的弱缩放效率,这证明了我们的域分解方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号