首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >GPU Tensor Cores for Fast Arithmetic Reductions
【24h】

GPU Tensor Cores for Fast Arithmetic Reductions

机译:GPU张量核心,用于快速算术减少

获取原文
获取原文并翻译 | 示例

摘要

This article proposes a parallel algorithm for computing the arithmetic reduction of n numbers as a set of matrix-multiply accumulate (MMA) operations that are executed simultaneously by GPU tensor cores. The analysis, assuming tensors of size m x m, shows that the proposed algorithm has a parallel running time of T(n) = 5log(m2n) and a speedup of S = 4/5 log(2) m(2) over a canonical parallel reduction. Experimental performance results on a Tesla V100 GPU show that the tensor-core based approach is energy efficient and runs up to similar to 3.2x and 2x faster than a standard GPU-based reduction and Nvidia's CUB library, respectively, while keeping the numerical error below 1 percent with respect to a double precision CPU reduction. The chained design of the algorithm allows a flexible configuration of GPU thread-blocks and the optimal values found through experimentation agree with the theoretical ones. The results obtained in this work show that GPU tensor cores are relevant not only for Deep Learning or Linear Algebra computations, but also for applications that require the acceleration of large summations.
机译:本文提出了一种并行算法,用于计算N个数字的算术降低,作为由GPU张量核同时执行的矩阵乘积(MMA)操作。假设尺寸MXM的张量的分析表明,所提出的算法具有T(n)= 5log(m2n)的并行运行时间,并且在规范并行上的s = 4/5 log(2)的加速减少。 Tesla V100 GPU上的实验性能结果表明,张核基于核心的方法是节能,并达到3.2倍和2倍,分别比标准的基于GPU的减少和NVIDIA的幼崽库相似,同时保持下面的数值误差1%相对于双重精度CPU减少。算法的链接设计允许灵活地配置GPU线程块,通过实验发现的最佳值与理论上一致。在本作品中获得的结果表明,GPU张量核心不仅与深度学习或线性代数计算相关,而且对于需要加速大求和的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号