Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

机译：GPU上迭代模板循环的性能建模与自动幽灵区优化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.

机译：迭代模版循环（ISL）在许多应用中使用，并且百口序是一种众所周知的技术，可以本地化其计算。当ISL在并行架构上铺平时，通常需要更新并在不同的处理元件（PE）之间更新和交换所需的Halo区域。此外，同步通常用于发出光晕交换的完成。通信和同步都可能在具有共享内存的并行体系结构上产生大量的开销。在图形处理器（GPU）的情况下，这尤其如此，该GPUS不保留跨全局同步的每核L1存储状态。为了减少这些开销，可以创建幽灵区域以复制模版操作，以减少在多个PE上冗余计算某些值的费用，降低通信和同步成本。但是，选择最佳幽灵区大小取决于架构和应用程序的特征，并且才研究了网格环境中的消息传递系统。为了自动化共享内存系统上的此过程，我们使用NVIDIA的TESLA架构建立性能模型作为案例研究，并提出使用性能模型自动选择最佳和生成适当代码的幽灵区域大小的框架。通过四种不同的ISL应用程序验证了建模，其中预测的幽灵区域配置能够实现不低于最佳加速度的98％的加速。

著录项

来源
《International conference on Supercomputing》|2009年||共10页
会议地点
作者
Jiayuan Meng; Kevin Skadron;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
ghost zone; gpu; parallel computing; stencil computation;

机译：幽灵区;GPU;并行计算;模板计算;

相似文献

外文文献
中文文献
专利

1. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations [J] . Jiayuan Meng, Kevin Skadron International journal of parallel programming . 2011,第1期

机译：带有Ghost区域优化的GPU上的迭代模板循环的性能研究
2. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations [J] . Jiayuan Meng, Kevin Skadron International Journal of Parallel Programming . 2011,第1期

机译：带有Ghost区域优化的GPU上的迭代模板循环的性能研究
3. Practical applicability of optimizations and performance models to complex stencil-based loop kernels in CFD [J] . Wichmann Karl-Robert, Kronbichler Martin, Loehner Rainald, Experimental Mechanics . 2019,第4期

机译：优化和性能模型对CFD中基于模板的复杂循环内核的实际适用性
4. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs [C] . Jiayuan Meng, Kevin Skadron International conference on Supercomputing . 2009

机译：GPU上的迭代模版循环的性能建模和自动重影区优化
5. Optimization of Stencil Computations on GPUs [D] . Rawat, Prashant Singh. 2018

机译：在GPU上优化模板计算
6. High-performance blob-based iterative three-dimensional reconstruction in electron tomography using multi-GPUs [O] . Xiaohua Wan, Fa Zhang, Qi Chu, 2012

机译：使用多GPU的电子层析成像中基于斑点的高性能迭代三维重建
7. Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on GPUs [O] . Jiayuan Meng, Kevin Skadron 2012

机译：GPU上的迭代模具循环的性能建模和自动重影区优化
8. Block-Iterative Methods for 3D Constant- Coefficient Stencils on GPUs and Multicore CPUs. [R] . Rodriguez, M., Philip, B., Wang, Z., 2014

机译：GpU和多核CpU上3D恒定系数模板的块迭代方法。

Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅