首页> 外文期刊>ACM transactions on mathematical software >Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
【24h】

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

机译:适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核

获取原文
获取原文并翻译 | 示例

摘要

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
机译:批处理密集型线性代数核在科学应用中变得无处不在,从深度学习中的张量收缩到分层低秩矩阵逼近中的数据压缩。在单个API调用中,这些内核能够同时启动多达数千个相似的矩阵计算,从而消除了多个API调用的昂贵开销,同时增加了底层硬件的占用率。一个挑战是,对于现有的硬件环境(x86,GPU等),供应商仅实现了所需批处理操作的一部分,而对非常小的问题规模的支持有限。我们使用单个和多个GPU在很小的数据大小(最多256个)上描述了新型的批三角密集线性代数内核的设计和性能。通过部署递归公式,强调寄存器使用,维护数据局部性,减少线程同步以及融合连续的内核调用,新的批处理内核优于现有的最新实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号