GPU Tensor Cores for Fast Arithmetic Reductions

Navarro Cristobal A.; Carrasco Roberto; Barrientos Ricardo J.; Riquelme Javier A.; Vega Raimundo

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >GPU Tensor Cores for Fast Arithmetic Reductions

【24h】

GPU Tensor Cores for Fast Arithmetic Reductions

机译：GPU张量核心，用于快速算术减少

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This article proposes a parallel algorithm for computing the arithmetic reduction of n numbers as a set of matrix-multiply accumulate (MMA) operations that are executed simultaneously by GPU tensor cores. The analysis, assuming tensors of size m x m, shows that the proposed algorithm has a parallel running time of T(n) = 5log(m2n) and a speedup of S = 4/5 log(2) m(2) over a canonical parallel reduction. Experimental performance results on a Tesla V100 GPU show that the tensor-core based approach is energy efficient and runs up to similar to 3.2x and 2x faster than a standard GPU-based reduction and Nvidia's CUB library, respectively, while keeping the numerical error below 1 percent with respect to a double precision CPU reduction. The chained design of the algorithm allows a flexible configuration of GPU thread-blocks and the optimal values found through experimentation agree with the theoretical ones. The results obtained in this work show that GPU tensor cores are relevant not only for Deep Learning or Linear Algebra computations, but also for applications that require the acceleration of large summations.

机译：本文提出了一种并行算法，用于计算N个数字的算术降低，作为由GPU张量核同时执行的矩阵乘积（MMA）操作。假设尺寸MXM的张量的分析表明，所提出的算法具有T（n）= 5log（m2n）的并行运行时间，并且在规范并行上的s = 4/5 log（2）的加速减少。 Tesla V100 GPU上的实验性能结果表明，张核基于核心的方法是节能，并达到3.2倍和2倍，分别比标准的基于GPU的减少和NVIDIA的幼崽库相似，同时保持下面的数值误差1％相对于双重精度CPU减少。算法的链接设计允许灵活地配置GPU线程块，通过实验发现的最佳值与理论上一致。在本作品中获得的结果表明，GPU张量核心不仅与深度学习或线性代数计算相关，而且对于需要加速大求和的应用。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2021年第1期|72-84|共13页
作者
Navarro Cristobal A.; Carrasco Roberto; Barrientos Ricardo J.; Riquelme Javier A.; Vega Raimundo;
展开▼
作者单位

Univ Austral Chile Inst Informat Valdivia 5110566 Los Ros Chile;

Univ Austral Chile Inst Informat Valdivia 5110566 Los Ros Chile;

Univ Catolica Maule Fac Engn Sci Dept DCI Lab Technol Res Pattern Recognit LITRP Talca 3605 Chile;

Univ Catolica Maule Fac Engn Sci Dept DCI Lab Technol Res Pattern Recognit LITRP Talca 3605 Chile;

Univ Austral Chile Inst Informat Valdivia 5110566 Los Ros Chile;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Arithmetic reduction; GPU computing; tensor cores; matrix multiply accumulate; parallel reduction;

机译：算术减少;GPU计算;张量核心;矩阵乘以累积;平行减少;

相似文献

外文文献
中文文献
专利

1. Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs [J] . Li Ang, Su Simon IEEE Transactions on Parallel and Distributed Systems . 2021,第7期

机译：通过TITE GPU的比特 - 张量芯加速二值化神经网络
2. Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs [J] . Joaquín Franco, Gregorio Bernabé, Juan Fernández, Procedia Computer Science . 2010,第1期

机译：在许多核GPU和多核CPU上进行并行3D快速小波变换
3. Cache-Aware Out-of-Core Tensor Decomposition on GPUs [J] . Tsai Yu-Ting, Wang Wei-Jhih, Kao Tzu-Yuan Journal of information science and engineering . 2018,第6期

机译：GPU上的缓存感知核心外张量分解
4. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers [C] . Azzam Haidar, Stanimire Tomov, Jack Dongarra, International Conference for High Performance Computing, Networking, Storage and Analysis . 2018

机译：利用GPU Tensor内核实现快速FP16算法，以加快混合精度迭代优化求解器的速度
5. Static analysis for efficient affine arithmetic on GPUs. [D] . Chan, Bryan. 2008

机译：静态分析，可在GPU上进行高效的仿射算法。
6. BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs [O] . Anders Eklund, Paul Dufort, Mattias Villani, 2014

机译：BROCCOLI：用于在多核CPU和GPU上进行快速fMRI分析的软件
7. GPU Tensor Cores for Fast Arithmetic Reductions [O] . Cristobal A. Navarro, Roberto Carrasco, Ricardo J. Barrientos, 2021

机译：GPU张量核心，用于快速算术减少

GPU Tensor Cores for Fast Arithmetic Reductions

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅