Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs

机译：在GPU上使用半精度算法对小尺寸快速批处理矩阵乘法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.

机译：矩阵乘法（GEMM）是稠密线性代数中最重要的运算。由于这是一个运算绑定操作，具有大量的数据重用性，因此来自不同科学领域的许多应用程序都将其最关键的性能阶段转换为使用GEMM。随着批处理线性代数的兴起，批处理GEMM运算在除密集线性求解器之外的其他领域（例如张量收缩，稀疏直接求解器和机器学习）变得越来越流行。特别是对于后者，降低精度的批处理GEMM（即FP16）已成为许多深度学习框架的核心操作。本文介绍了使用图形处理单元（GPU）的针对FP16算术（HGEMM）的优化批处理GEMM。我们提供了详细的设计策略，该策略可以利用最近在启用CUDA的GPU中引入的Tensor Core技术。开发的解决方案使用供应商提供的低级API进行优化设计，从而克服了硬件（以离散配置的形式）施加的限制。结果是高度灵活的GPU内核，尽管有上述限制，但仍为开发人员提供了许多控件。本文还特别注意无法完全占据Tensor Core单位的非常小的矩阵的乘法。我们的结果表明，使用Tesla V100 GPU，建议的设计在大小高达100的情况下可以胜过高度优化的供应商例程，其倍数在1.2到10倍之间。对于极小的矩阵，观察到的加速范围为1.8倍至26倍。

著录项

来源
《IEEE International Parallel and Distributed Processing Symposium》|2019年|111-122|共12页
会议地点
作者
Ahmad Abdelfattah; Stanimire Tomov; Jack Dongarra;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Matrix multiplication; batched linear algebra; FP16 arithmetic; GPU computing;

机译：矩阵乘法分支线性代数FP16算法GPU计算;

相似文献

外文文献
中文文献
专利

1. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核
2. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [J] . Charara Ali, Keyes David, Ltaief Hatem ACM transactions on mathematical software . 2019,第2期

机译：用于GPU上非常小的矩阵大小的批量三角形致密线性代数粒
3. A Family of Bit-Representation-Optimized Formats for Fast Sparse Matrix-Vector Multiplication on the GPU [J] . Tang Wai Teng, Tan Wen Jun, Goh Rick Siow Mong, Parallel and Distributed Systems, IEEE Transactions on . 2015,第9期

机译：GPU上用于快速稀疏矩阵矢量乘法的一系列位表示优化格式
4. Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs [C] . Ahmad Abdelfattah, Stanimire Tomov, Jack Dongarra IEEE International Parallel and Distributed Processing Symposium . 2019

机译：在GPU上使用半精度算法的小型尺寸快速批量矩阵乘法
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. Fast and efficient fully 3D PET image reconstruction using sparse system matrix factorization with GPU acceleration [O] . Jian Zhou, Jinyi Qi -1

机译：使用具有GpU加速稀疏系统矩阵分解快速高效的全3D pET图像重建
7. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs [O] . Charara Ali, Keyes David Elliot, Ltaief Hatem 2017

机译：适用于GPU上非常小的矩阵大小的批处理三角密集线性代数核

Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅