首页> 外文会议>International Symposium on Embedded Multicore/Many-core Systems-on-Chip >Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs
【24h】

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs

机译:GPU上的内存绑定BLAS内核的自动线程块大小调整

获取原文

摘要

The performance of a CUDA kernel often depends on the number of threads per thread-block (thread-block size), and the optimal configuration differs according to the graphics processing unit (GPU) hardware and the given data size to the kernel. In particular, in linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS), most routines support a wide range of problem sizes and various processors with different architectures or number of cores. Therefore, we need a method to adjust the thread-block size automatically in an economical and theoretical manner as much as possible depending on the circumstances of the routine call. In this study, we propose a method to adjust the thread-block size for several memory-bound BLAS kernels on NVIDIA GPUs. Our method is a model-driven approach that can automatically determine the thread-block size on the basis of three occupancy models on the warp, thread-block, and grid level before every launch of the kernel. We demonstrate that our method determines nearly optimal thread-block size for several kernels on Kepler and Maxwell architecture GPUs.
机译:CUDA内核的性能通常取决于每个线程块的线程数(线程块大小),并且最佳配置根据图形处理单元(GPU)硬件和给定的内核数据大小而有所不同。特别是在诸如基本线性代数子程序(BLAS)之类的线性代数库中,大多数例程都支持各种问题大小以及具有不同体系结构或核数的各种处理器。因此,我们需要一种方法,可以根据例行调用的情况,以尽可能经济和理论上的方式自动调整线程块大小。在这项研究中,我们提出了一种为NVIDIA GPU上的多个内存绑定BLAS内核调整线程块大小的方法。我们的方法是一种模型驱动的方法,可以在每次启动内核之前根据扭曲,线程块和网格级别上的三个占用模型自动确定线程块的大小。我们证明了我们的方法为开普勒和麦克斯韦架构GPU上的几个内核确定了几乎最佳的线程块大小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号