Memory Access Optimization of High-Order CFD Stencil Computations on GPU

机译：GPU上高阶CFD模板计算的内存访问优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Stencils computations are a class of computations commonly found in scientific and engineering applications. They have relatively lower arithmetic intensity. Therefore, their performance is greatly affected by memory access. This paper studies the issue of memory access optimization for the key stencil computations of a high-order CFD program on the NVidia GPU. Two methods are used to optimize the performance. First, we use registers to cache the data used by the stencil computations in the kernel. We use the CUDA warp shuffle functions to exchange data between neighboring grid points, and adjust the thread computation granularity to increase the data reuse. Second, we use the shared memory to buffer the grid data used by the stencil computations in the kernel, and utilize loop tiling to reduce redundant accesses to the global memory. Performance evaluation is done on an NVidia Tesla K80 GPU. The results show that compared to the original implementation that only uses the global memory, the optimized implementation that utilizes the registers achieves a maximum speedup of 2.59 and 2.79 relatively for 15M and 60M grids, and the optimized implementation that utilizes the shared memory achieves a maximum speedup of 3.51 and 3.36 relatively for 15M and 60M grids.

机译：模板计算是科学和工程应用中常见的一类计算。它们具有相对较低的算术强度。因此，它们的性能受到内存访问的影响很大。本文研究了NVIDIA GPU上的高阶CFD程序的密钥模板计算的内存访问优化问题。两种方法用于优化性能。首先，我们使用寄存器缓存模板计算中的数据中内核中使用的数据。我们使用CUDA Warp Shuffle功能在相邻网格点之间交换数据，并调整线程计算粒度以增加数据重用。其次，我们使用共享内存来缓冲内核中的模板计算使用的网格数据，并利用循环划线以减少到全局存储器的冗余访问。在NVIDIA Tesla K80 GPU上进行了绩效评估。结果表明，与仅使用全局内存的原始实现相比，利用寄存器的优化实现实现了15M和60M网格的最大加速度为2.59和2.79，以及利用共享内存的优化实现实现最大值相对于15米和60米的网格加速3.51和3.36。

著录项

来源
《International Conference on Parallel and Distributed Computing, Applications, and Technologies》|2020年|43-56|共14页
会议地点
作者
Shengxiang Wang; Zhuoqian Li; Yonggang Che;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Stencil computation; NVidia GPU; Warp shuffle; Register caching; Shared memory; Loop tiling;

机译：模板计算;nvidia gpu;经线洗牌;注册缓存;共享记忆;循环平铺;
入库时间 2022-08-26 13:56:47

相似文献

外文文献
中文文献
专利

1. Evaluating optimizations that reduce globalmemory accesses of stencil computations in GPGPUs [J] . Thiago Carrijo Nasciutti, Jairo Panetta, Pedro Pais Lopes Concurrency, practice and experience . 2019,第18期

机译：评估减少GPGPU中模板计算的全局内存访问的优化
2. Evaluating optimizations that reduce globalmemory accesses of stencil computations in GPGPUs [J] . Thiago Carrijo Nasciutti, Jairo Panetta, Pedro Pais Lopes Concurrency, practice and experience . 2019,第18期

机译：评估减少GPGPU中的模板计算的GlobalMemory访问的优化
3. Highly Optimized Code Generation for Stencil Codes with Computation Reuse for GPUs [J] . Wen-Jing Ma, Kan Gao, Guo-Ping Long 计算机科学技术学报（英文版） . 2016,第006期

机译：模板代码的高度优化代码生成以及GPU的计算复用
4. A parallel optimization method for stencil computation on the domain that is bigger than memory capacity of GPUs [C] . Jin Guanghao, Endo Toshio, Matsuoka Satoshi IEEE International Conference on Cluster Computing . 2013

机译：在大于GPU的存储容量的域上进行模板计算的并行优化方法
5. Optimization of Stencil Computations on GPUs [D] . Rawat, Prashant Singh. 2018

机译：在GPU上优化模板计算
6. DOPA: GPU-based protein alignment using database and memory access optimizations [O] . Laiq Hasan, Marijn Kentie, Zaid Al-Ars 2011

机译：DOPA：使用数据库和内存访问优化的基于GPU的蛋白质比对
7. 3.5d blocking optimization for stencil computations on modern CPUs and GPUs [O] . Anthony Nguyen, Nadathur Satish, Jatin Chhugani, 2013

机译：用于现代CPU和GPU上模版计算的3.5d块优化

Memory Access Optimization of High-Order CFD Stencil Computations on GPU

摘要

著录项

相似文献

相关主题

期刊订阅