首页> 外文期刊>Computer physics communications >GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5
【24h】

GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

机译:使用OpenMP 4.5加速高施密特号码的湍流混合的GPU加速

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes. (C) 2018 Elsevier B.V. All rights reserved.
机译:本文报道了大规模平行GPU加速算法的成功实施,用于高施密特数的湍流混合的直接数值模拟。该工作源于最近的开发(计算。物理。Communce。,Vol.219,2017,313-328),其中显示了在重叠通信时在CRAY XE6架构上获得高度可扩展性的低通信算法通过专用通信线程计算。现在已经在CRAY XK7架构上使用OpenMP 4.5实现了更高水平的性能,每个节点在每个节点上都有16个Interlaragos处理器的16个整数内核共享单个NVIDIA K20x GPU加速器。在新算法中,通过在GPU上的组合紧凑的有限差(CCD)操作的形式的几乎所有密集的标量场计算来最小化数据移动。发现在常规实践中出发的存储器布局为应用CCD方案所需的特定内核提供更好的性能。通过将OpenMP 4.5 Nowait子句添加到目标构造的异步执行可提高可伸缩性,以便在GPU上与CPU上的计算和通信重叠GPU时。在USA橡树岭国家实验室的27-Petaflops超级计算机泰坦,在用8192 XK7节点计算的标量场的最大问题大小为81923网点的最大问题大小,始终观察到大约5的GPU-to-CPU加速度。 (c)2018 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号