Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs

机译：GPU上的内存绑定BLAS内核的自动线程块大小调整

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The performance of a CUDA kernel often depends on the number of threads per thread-block (thread-block size), and the optimal configuration differs according to the graphics processing unit (GPU) hardware and the given data size to the kernel. In particular, in linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS), most routines support a wide range of problem sizes and various processors with different architectures or number of cores. Therefore, we need a method to adjust the thread-block size automatically in an economical and theoretical manner as much as possible depending on the circumstances of the routine call. In this study, we propose a method to adjust the thread-block size for several memory-bound BLAS kernels on NVIDIA GPUs. Our method is a model-driven approach that can automatically determine the thread-block size on the basis of three occupancy models on the warp, thread-block, and grid level before every launch of the kernel. We demonstrate that our method determines nearly optimal thread-block size for several kernels on Kepler and Maxwell architecture GPUs.

机译：CUDA内核的性能通常取决于每个线程块的线程数（线程块大小），并且最佳配置根据图形处理单元（GPU）硬件和给定数据大小与内核的定义不同。特别地，在基本线性代数子程序（BLA）之类的线性代数库中，大多数例程都支持各种问题尺寸和具有不同架构或核心数的各种处理器。因此，我们需要一种方法来根据日常呼叫的情况尽可能多地以经济和理论的方式自动调整线段块大小。在这项研究中，我们提出了一种方法来调整NVIDIA GPU上的几个内存BLAS内核的线块大小。我们的方法是一种模型驱动方法，可以在每次启动内核之前，在宏，线程块和网格级别的三个占用模型的基础上自动确定线程块大小。我们展示我们的方法在开普勒和麦克斯韦架构GPU上确定几个内核的几乎最佳的线程块大小。

著录项

来源
《International Symposium on Embedded Multicore/Many-core Systems-on-Chip》|2016年|xvi 394 p.|共8页
会议地点
作者
Daichi Mukunoki; Toshiyuki Imamura; Daisuke Takahashi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN43-532;
关键词
Instruction sets; Graphics processing units; Kernel; Message systems; Computer architecture; Linear algebra; Biological system modeling;

机译：指令集;图形处理单元;内核;消息系统;计算机架构;线性代数;生物系统建模;

相似文献

外文文献
中文文献
专利

1. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核
2. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：用于GPU上非常小的矩阵大小的批量三角形致密线性代数粒
3. Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study [J] . S. Tabik, G. Ortega, E. M. Garzon Journal of supercomputing . 2014,第2期

机译：GPU上的内核融合BLAS例程的性能评估：迭代求解器作为案例研究
4. Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs [C] . Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi International Symposium on Embedded Multicore/Many-core Systems-on-Chip . 2016

机译：GPU上的内存绑定BLAS内核的自动线程块大小调整
5. Automatic transformation and optimization of applications on GPUs and GPU clusters. [D] . Ma, Wenjing. 2011

机译：在GPU和GPU群集上自动转换和优化应用程序。
6. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs [O] . Jan Fostier 2020

机译：BLAMM：基于BLAS的算法用于查找CPU和GPU上DNA序列中的位置权重矩阵
7. A versatile software systolic execution model for GPU memory-bound kernels [O] . Peng Chen, Mohamed Wahib, Shinichiro Takizawa, 2019

机译：GPU内存内核的多功能软件收缩期执行模型

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅