Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara Ali; Keyes David; Ltaief Hatem

首页> 外文期刊>ACM transactions on mathematical software >Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

【24h】

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

机译：用于GPU上非常小的矩阵大小的批量三角形致密线性代数粒

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.

机译：批量密集的线性代数核在科学应用中遭受普遍存在，从张力学习中的张量凹陷到分层低级矩阵近似的数据压缩。在单个API呼叫中，这些内核能够同时启动多达数千个类似的矩阵计算，在增加底层硬件的占用时，去除多个API调用的昂贵开销。挑战是，对于现有的硬件景观（X86，GPU等），供应商仅实现所需批次操作的子集，对非常小的问题大小的支持有限。我们使用单个GPU来描述在非常小的数据大小（最多256）上的新类批量三角形致密线性代数粒细胞的设计和性能。通过部署递归配方，强调寄存器使用，维护数据位置，减少线程同步和融合连续内核呼叫，新的批量内核优于现有的最先进的实现。

著录项

来源
《ACM transactions on mathematical software》 |2019年第2期|15.1-15.28|共28页
作者
Charara Ali; Keyes David; Ltaief Hatem;
展开▼
作者单位

8716 Barbee Lane Knoxville TN 37923 USA;

4700 King Alxiullah Univ Sci & Technol Extreme Comp Res Ctr 1 Level 0 Room 0119 Thuwal 239556900 Saudi Arabia;

4700 King Alxiullah Univ Sci & Technol Extreme Comp Res Ctr 1 Level 0 Room 0119 Thuwal 239556900 Saudi Arabia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
KBLAS; recursive formulation; batched BLAS kernels; dense linear algebra; hardware accelerators;

机译：KBLA;递归制剂;批量BLAS内核;密集的线性代数;硬件加速器;

相似文献

外文文献
中文文献
专利

1. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核
2. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators [J] . Jack Dongarra, Mark Gates, Jakub Kurzak, Proceedings of the IEEE . 2018,第11期

机译：使用GPU硬件加速器自动调谐数值密集线性代数以进行批处理计算
3. Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs [J] . Journal of Computational and Applied Mathematics . 2020,第期

机译：GPU上准确和混合精密线性代数粒细胞的性能和能耗
4. Accelerating GPU Kernels for Dense Linear Algebra [C] . Rajib Nath, Stanimire Tomov, Jack Dongarra International Conference on High Performance Computing for Computational Science . 2011

机译：加速GPU内核为密集的线性代数
5. Randomized Numerical Linear Algebra for Kernel Matrix Compression [D] . Maniar, Saumya Kandarp. 2020

机译：用于内核矩阵压缩的随机数值线性代数
6. High-performance simplification of triangular surfaces using a GPU [O] . Mohamed H. Mousa, Mohamed K. Hussein 2021

机译：使用GPU的三角形表面的高性能简化
7. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [O] . Charara Ali, Keyes David Elliot, Ltaief Hatem 2017

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅