首页> 外文会议>International conference on Supercomputing >Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
【24h】

Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

机译:GPU上迭代模板循环的性能建模与自动幽灵区优化

获取原文

摘要

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.
机译:迭代模版循环(ISL)在许多应用中使用,并且百口序是一种众所周知的技术,可以本地化其计算。当ISL在并行架构上铺平时,通常需要更新并在不同的处理元件(PE)之间更新和交换所需的Halo区域。此外,同步通常用于发出光晕交换的完成。通信和同步都可能在具有共享内存的并行体系结构上产生大量的开销。在图形处理器(GPU)的情况下,这尤其如此,该GPUS不保留跨全局同步的每核L1存储状态。为了减少这些开销,可以创建幽灵区域以复制模版操作,以减少在多个PE上冗余计算某些值的费用,降低通信和同步成本。但是,选择最佳幽灵区大小取决于架构和应用程序的特征,并且才研究了网格环境中的消息传递系统。为了自动化共享内存系统上的此过程,我们使用NVIDIA的TESLA架构建立性能模型作为案例研究,并提出使用性能模型自动选择最佳和生成适当代码的幽灵区域大小的框架。通过四种不同的ISL应用程序验证了建模,其中预测的幽灵区域配置能够实现不低于最佳加速度的98%的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号