首页> 外文期刊>International Journal of Parallel Programming >A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations
【24h】

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

机译:带有Ghost区域优化的GPU上的迭代模板循环的性能研究

获取原文
获取原文并翻译 | 示例

摘要

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.
机译:迭代模板循环(ISL)在许多应用程序中使用,并且切片是一种本地化其计算的众所周知的技术。当ISL跨并行体系结构平铺时,通常会有需要在不同处理元素(PE)之间更新和交换的光晕区域。此外,同步通常用于表示光环交换已完成。在具有共享内存的并行体系结构上,通信和同步都可能导致大量开销。在图形处理器(GPU)的情况下尤其如此,图形处理器(GPU)不会在全局同步中保留每核L1存储的状态。为了减少这些开销,可以创建幻影区来复制模板操作,从而减少通信和同步成本,但需要在多个PE上冗余地计算一些值。但是,最佳重影区大小的选择取决于体系结构和应用程序的特性,并且仅针对分布式环境中的消息传递系统进行了研究。为了在共享内存系统上实现此过程的自动化,我们以NVIDIA的Tesla架构为案例研究建立了性能模型,并提出了一个使用该性能模型的框架来自动选择性能最佳的幽灵区大小并生成适当的代码。该建模通过四个不同的ISL应用程序进行了验证,对于这些应用程序,预测的幻影区域配置能够实现不小于最佳加速比的95%的加速比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号