...
首页> 外文期刊>ACM transactions on mathematical software >Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
【24h】

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

机译:用于GPU上非常小的矩阵大小的批量三角形致密线性代数粒

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
机译:批量密集的线性代数核在科学应用中遭受普遍存在,从张力学习中的张量凹陷到分层低级矩阵近似的数据压缩。在单个API呼叫中,这些内核能够同时启动多达数千个类似的矩阵计算,在增加底层硬件的占用时,去除多个API调用的昂贵开销。挑战是,对于现有的硬件景观(X86,GPU等),供应商仅实现所需批次操作的子集,对非常小的问题大小的支持有限。我们使用单个GPU来描述在非常小的数据大小(最多256)上的新类批量三角形致密线性代数粒细胞的设计和性能。通过部署递归配方,强调寄存器使用,维护数据位置,减少线程同步和融合连续内核呼叫,新的批量内核优于现有的最先进的实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号