Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara Ali; Keyes David; Ltaief Hatem

首页> 外文期刊>ACM transactions on mathematical software >Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

【24h】

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.

机译：批处理密集型线性代数核在科学应用中变得无处不在，从深度学习中的张量收缩到分层低秩矩阵逼近中的数据压缩。在单个API调用中，这些内核能够同时启动多达数千个相似的矩阵计算，从而消除了多个API调用的昂贵开销，同时增加了底层硬件的占用率。一个挑战是，对于现有的硬件环境（x86，GPU等），供应商仅实现了所需批处理操作的一部分，而对非常小的问题规模的支持有限。我们使用单个和多个GPU在很小的数据大小（最多256个）上描述了新型的批三角密集线性代数内核的设计和性能。通过部署递归公式，强调寄存器使用，维护数据局部性，减少线程同步以及融合连续的内核调用，新的批处理内核优于现有的最新实现。

著录项

来源
《ACM transactions on mathematical software》 |2019年第2期|15.1-15.28|共28页
作者
Charara Ali; Keyes David; Ltaief Hatem;
展开▼
作者单位

8716 Barbee Lane, Knoxville, TN 37923 USA;

4700 King Alxiullah Univ Sci & Technol, Extreme Comp Res Ctr, 1 Level 0,Room 0119, Thuwal 239556900, Saudi Arabia;

4700 King Alxiullah Univ Sci & Technol, Extreme Comp Res Ctr, 1 Level 0,Room 0119, Thuwal 239556900, Saudi Arabia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
KBLAS; recursive formulation; batched BLAS kernels; dense linear algebra; hardware accelerators;

机译：KBLA;递归制剂;批量BLAS内核;密集的线性代数;硬件加速器;

相似文献

外文文献
中文文献
专利

1. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：用于GPU上非常小的矩阵大小的批量三角形致密线性代数粒
2. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators [J] . Jack Dongarra, Mark Gates, Jakub Kurzak, Proceedings of the IEEE . 2018,第11期

机译：使用GPU硬件加速器自动调谐数值密集线性代数以进行批处理计算
3. Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs [J] . Journal of Computational and Applied Mathematics . 2020,第期

机译：GPU上准确和混合精密线性代数粒细胞的性能和能耗
4. Accelerating GPU Kernels for Dense Linear Algebra [C] . Rajib Nath, Stanimire Tomov, Jack Dongarra International Conference on High Performance Computing for Computational Science . 2011

机译：加速GPU内核为密集的线性代数
5. Randomized Numerical Linear Algebra for Kernel Matrix Compression [D] . Maniar, Saumya Kandarp. 2020

机译：用于内核矩阵压缩的随机数值线性代数
6. High-performance simplification of triangular surfaces using a GPU [O] . Mohamed H. Mousa, Mohamed K. Hussein 2021

机译：使用GPU的三角形表面的高性能简化
7. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [O] . Charara Ali, Keyes David Elliot, Ltaief Hatem 2017

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅