A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Jiayuan Meng; Kevin Skadron

首页> 外文期刊>International Journal of Parallel Programming >A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

【24h】

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

机译：带有Ghost区域优化的GPU上的迭代模板循环的性能研究

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

AI期刊论文写作 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.

机译：迭代模板循环（ISL）在许多应用程序中使用，并且切片是一种本地化其计算的众所周知的技术。当ISL跨并行体系结构平铺时，通常会有需要在不同处理元素（PE）之间更新和交换的光晕区域。此外，同步通常用于表示光环交换已完成。在具有共享内存的并行体系结构上，通信和同步都可能导致大量开销。在图形处理器（GPU）的情况下尤其如此，图形处理器（GPU）不会在全局同步中保留每核L1存储的状态。为了减少这些开销，可以创建幻影区来复制模板操作，从而减少通信和同步成本，但需要在多个PE上冗余地计算一些值。但是，最佳重影区大小的选择取决于体系结构和应用程序的特性，并且仅针对分布式环境中的消息传递系统进行了研究。为了在共享内存系统上实现此过程的自动化，我们以NVIDIA的Tesla架构为案例研究建立了性能模型，并提出了一个使用该性能模型的框架来自动选择性能最佳的幽灵区大小并生成适当的代码。该建模通过四个不同的ISL应用程序进行了验证，对于这些应用程序，预测的幻影区域配置能够实现不小于最佳加速比的95％的加速比。

著录项

来源
《International Journal of Parallel Programming》 |2011年第1期|p.115-142|共28页
作者
Jiayuan Meng; Kevin Skadron;
展开▼
作者单位

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations [J] . Jiayuan Meng, Kevin Skadron International journal of parallel programming . 2011,第1期

机译：带有Ghost区域优化的GPU上的迭代模板循环的性能研究
2. A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study [J] . Tabik S., Peemen M., Romero L. F. Journal of supercomputing . 2018,第4期

机译：GPU上的迭代3D模板流水线的一种调整方法：各向异性非线性扩散算法作为案例研究
3. Stencil-aware GPU optimization of iterative solvers (Conference Paper) [J] . Lowell D., Godwin J., Holewinski J., SIAM Journal on Scientific Computing . 2013,第5期

机译：模板求解器对迭代求解器的GPU优化（会议论文）
4. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs [C] . Jiayuan Meng, Kevin Skadron International conference on Supercomputing . 2009

机译：GPU上的迭代模版循环的性能建模和自动重影区优化
5. Optimization of Stencil Computations on GPUs [D] . Rawat, Prashant Singh. 2018

机译：在GPU上优化模板计算
6. High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs [O] . Xiaohua Wan, Fa Zhang, Qi Chu, 2012

机译：使用多GPU的电子层析成像中基于斑点的高性能迭代三维重建
7. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations [O] . Jiayuan Meng, Kevin Skadron 2012

机译：具有鬼区优化的GpU上迭代模板循环的性能研究
8. Block-Iterative Methods for 3D Constant- Coefficient Stencils on GPUs and Multicore CPUs. [R] . Rodriguez, M., Philip, B., Wang, Z., 2014

机译：GpU和多核CpU上3D恒定系数模板的块迭代方法。

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅