An Efficient Skinny Matrix-Matrix Multiplication Method by Folding Input Matrices into Tensor Core Operations

机译：通过将输入矩阵折叠成张量核心操作的高效瘦矩阵矩阵乘法方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A specialized unit in NVIDIA’s GPUs, called Tensor Core, keeps attracting attention in the last couple of years due to its high computing capability for general matrix-matrix multiplications (GEMMs). A Tensor Core unit is capable of calculating a matrix multiply-accumulate (MMA) operation of a specific size. However, if the size of input matrices is skinner than that of a Tensor Core operation, some computations of a Tensor Core operation become wasted. Thus, this paper presents a method to optimize the calculation of skinny matrix-matrix multiplication that exploits the potential of the Tensor core units. The proposed method feeds multiple segments of an input matrix into a Tensor Core operation to utilize more computations. The experimental results show that the proposed method achieves up to a 2.7× speedup compared with the cuBLAS 11.0 library.

机译：由于其高计算能力（Gemms）的高计算能力，NVIDIA的GPU中的专门单位称为张量核心，在过去几年中，在过去几年中受到引起的注意。张量核心单元能够计算特定尺寸的矩阵乘法累积（MMA）操作。然而，如果输入矩阵的尺寸比张量芯操作的尺寸比呈张芯操作的大小，则浪费芯操作的一些计算变浪费。因此，本文提出了一种优化利用张解核心单元电位的瘦矩阵矩阵乘法计算的方法。所提出的方法将输入矩阵的多个片段馈送到张量核心操作中以利用更多计算。实验结果表明，与Cublas 11.0文库相比，该方法的加速高达2.7倍。

著录项

来源
《International Symposium on Computing and Networking Workshops》|2020年|164-167|共4页
会议地点
作者
Hao Tang; Kazuhiko Komatsu; Masayuki Sato; Hiroaki Kobayashi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Tensors; Memory management; Graphics processing units; Libraries; Filling; Feeds; Optimization;

机译：张量;内存管理;图形处理单元;图书馆;填充;饲料;优化;

相似文献

外文文献
中文文献
专利

1. An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster [J] . Hanrath M., Engels-Putzka A. The Journal of Chemical Physics . 2010,第6期

机译：基于有效矩阵矩阵乘法的反对称张量压缩引擎
2. TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs [J] . Cody Rivera, Jieyang Chen, Nan Xiong, Journal of Parallel and Distributed Computing . 2021,第May期

机译：TSM2X：GPU上的高性能高瘦矩阵矩阵乘法
3. MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES [J] . Gremse Felix, Kuepper Kerstin, Naumann Uwe SIAM Journal on Scientific Computing . 2018,第4期

机译：在许多核心架构上的行合并的内存高效的稀疏矩阵乘法
4. Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms [C] . Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, International conference on high performance computing . 2015

机译：多核平台上的并行高效稀疏矩阵-矩阵乘法
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. An efficient method for generalized linear multiplicative programming problem with multiplicative constraints [O] . Yingfeng Zhao, Sanyang Liu -1

机译：具有乘法约束的广义线性乘法规划问题的一种有效方法。
7. Simultaneous input and output matrix partitioning for outer-product-parallel sparse matrix-matrix multiplication [O] . Akbudak, K., Aykanat, C. 2014

机译：外部乘积并行稀疏矩阵矩阵乘法的同时输入和输出矩阵划分

An Efficient Skinny Matrix-Matrix Multiplication Method by Folding Input Matrices into Tensor Core Operations

摘要

著录项

相似文献

相关主题

期刊订阅