Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs

机译：GPU上的内存绑定BLAS内核的自动线程块大小调整

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The performance of a CUDA kernel often depends on the number of threads per thread-block (thread-block size), and the optimal configuration differs according to the graphics processing unit (GPU) hardware and the given data size to the kernel. In particular, in linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS), most routines support a wide range of problem sizes and various processors with different architectures or number of cores. Therefore, we need a method to adjust the thread-block size automatically in an economical and theoretical manner as much as possible depending on the circumstances of the routine call. In this study, we propose a method to adjust the thread-block size for several memory-bound BLAS kernels on NVIDIA GPUs. Our method is a model-driven approach that can automatically determine the thread-block size on the basis of three occupancy models on the warp, thread-block, and grid level before every launch of the kernel. We demonstrate that our method determines nearly optimal thread-block size for several kernels on Kepler and Maxwell architecture GPUs.

机译：CUDA内核的性能通常取决于每个线程块的线程数（线程块大小），并且最佳配置根据图形处理单元（GPU）硬件和给定的内核数据大小而有所不同。特别是在诸如基本线性代数子程序（BLAS）之类的线性代数库中，大多数例程都支持各种问题大小以及具有不同体系结构或核数的各种处理器。因此，我们需要一种方法，可以根据例行调用的情况，以尽可能经济和理论上的方式自动调整线程块大小。在这项研究中，我们提出了一种为NVIDIA GPU上的多个内存绑定BLAS内核调整线程块大小的方法。我们的方法是一种模型驱动的方法，可以在每次启动内核之前根据扭曲，线程块和网格级别上的三个占用模型自动确定线程块的大小。我们证明了我们的方法为开普勒和麦克斯韦架构GPU上的几个内核确定了几乎最佳的线程块大小。

著录项

来源
《International Symposium on Embedded Multicore/Many-core Systems-on-Chip》|2016年|377-384|共8页
会议地点
作者
Daichi Mukunoki; Toshiyuki Imamura; Daisuke Takahashi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Instruction sets; Graphics processing units; Kernel; Message systems; Computer architecture; Linear algebra; Biological system modeling;

机译：指令集;图形处理单元;内核;消息系统;计算机体系结构;线性代数;生物系统建模;

相似文献

外文文献
中文文献
专利

1. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核
2. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：用于GPU上非常小的矩阵大小的批量三角形致密线性代数粒
3. Performance evaluation of kernel fusion BLAS routines on the GPU: iterative solvers as case study [J] . S. Tabik, G. Ortega, E. M. Garzon Journal of supercomputing . 2014,第2期

机译：GPU上的内核融合BLAS例程的性能评估：迭代求解器作为案例研究
4. Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs [C] . Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi International Symposium on Embedded Multicore/Many-core Systems-on-Chip . 2016

机译：GPU上的内存绑定BLAS内核的自动线程块大小调整
5. Automatic transformation and optimization of applications on GPUs and GPU clusters. [D] . Ma, Wenjing. 2011

机译：在GPU和GPU群集上自动转换和优化应用程序。
6. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs [O] . Jan Fostier 2020

机译：BLAMM：基于BLAS的算法用于查找CPU和GPU上DNA序列中的位置权重矩阵
7. A versatile software systolic execution model for GPU memory-bound kernels [O] . Peng Chen, Mohamed Wahib, Shinichiro Takizawa, 2019

机译：GPU内存内核的多功能软件收缩期执行模型

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅