首页> 外文会议>Compiler construction. >Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality
【24h】

Automatic Restructuring of GPU Kernels for Exploiting Inter-thread Data Locality

机译:自动重组GPU内核以利用线程间数据局部性

获取原文
获取原文并翻译 | 示例

摘要

Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in today's HPC world. For many applications, however, achieving a high fraction of peak on current GPUs, still requires significant programmer effort. A key consideration for optimizing GPU code is determining a suitable amount of work to be performed by each thread. Thread granularity not only has a direct impact on occupancy but can also influence data locality at the register and shared-memory levels. This paper describes a software framework to analyze dependencies in parallel GPU threads and perform source-level restructuring to obtain GPU kernels with varying thread granularity. The framework supports specification of coarsening factors through source-code annotation and also implements a heuristic based on estimated register pressure that automatically recommends coarsening factors for improved memory performance. We present preliminary experimental results on a select set of CUDA kernels. The results show that the proposed strategy is generally able to select profitable coarsening factors. More importantly, the results demonstrate a clear need for automatic control of thread granularity at the software level for achieving higher performance.
机译:每个芯片数百个内核以及对细粒度多线程的支持使GPU成为当今HPC世界中的核心参与者。但是,对于许多应用程序而言,要在当前GPU上达到很高的峰值,仍然需要大量的程序员工作。优化GPU代码的关键考虑因素是确定每个线程要执行的适当工作量。线程粒度不仅会直接影响占用率,而且还会影响寄存器和共享内存级别的数据局部性。本文介绍了一种软件框架,用于分析并行GPU线程中的依赖性并执行源代码级重构,以获得具有不同线程粒度的GPU内核。该框架支持通过源代码注释指定粗化因子,并基于估计的寄存器压力实现启发式算法,该算法自动推荐粗化因子以提高内存性能。我们介绍了一组精选的CUDA内核的初步实验结果。结果表明,提出的策略通常能够选择有利可图的粗化因子。更重要的是,结果表明,对于在软件级别自动控制线程粒度以实现更高性能的明确需求。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号