首页> 外文会议>International Symposium on Embedded Multicore/Many-core Systems-on-Chip >Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs
【24h】

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs

机译:GPU上的内存绑定BLAS内核的自动线程块大小调整

获取原文

摘要

The performance of a CUDA kernel often depends on the number of threads per thread-block (thread-block size), and the optimal configuration differs according to the graphics processing unit (GPU) hardware and the given data size to the kernel. In particular, in linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS), most routines support a wide range of problem sizes and various processors with different architectures or number of cores. Therefore, we need a method to adjust the thread-block size automatically in an economical and theoretical manner as much as possible depending on the circumstances of the routine call. In this study, we propose a method to adjust the thread-block size for several memory-bound BLAS kernels on NVIDIA GPUs. Our method is a model-driven approach that can automatically determine the thread-block size on the basis of three occupancy models on the warp, thread-block, and grid level before every launch of the kernel. We demonstrate that our method determines nearly optimal thread-block size for several kernels on Kepler and Maxwell architecture GPUs.
机译:CUDA内核的性能通常取决于每个线程块的线程数(线程块大小),并且最佳配置根据图形处理单元(GPU)硬件和给定数据大小与内核的定义不同。特别地,在基本线性代数子程序(BLA)之类的线性代数库中,大多数例程都支持各种问题尺寸和具有不同架构或核心数的各种处理器。因此,我们需要一种方法来根据日常呼叫的情况尽可能多地以经济和理论的方式自动调整线段块大小。在这项研究中,我们提出了一种方法来调整NVIDIA GPU上的几个内存BLAS内核的线块大小。我们的方法是一种模型驱动方法,可以在每次启动内核之前,在宏,线程块和网格级别的三个占用模型的基础上自动确定线程块大小。我们展示我们的方法在开普勒和麦克斯韦架构GPU上确定几个内核的几乎最佳的线程块大小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号