首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs
【24h】

Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs

机译:在GPU上使用半精度算法对小尺寸快速批处理矩阵乘法

获取原文

摘要

Matrix multiplication (GEMM) is the most important operation in dense linear algebra. Because it is a compute-bound operation that is rich in data reuse, many applications from different scientific domains cast their most performance-critical stages to use GEMM. With the rise of batch linear algebra, batched GEMM operations have become increasingly popular in domains other than dense linear solvers, such as tensor contractions, sparse direct solvers, and machine learning. In particular for the latter, batched GEMM in reduced precision (i.e., FP16) has been the core operation of many deep learning frameworks. This paper introduces an optimized batched GEMM for FP16 arithmetic (HGEMM) using graphics processing units (GPUs). We provide a detailed design strategy that takes advantage of the Tensor Core technology that was recently introduced in CUDA-enabled GPUs. The developed solution uses low-level APIs provided by the vendor in an optimized design that overcomes the limitations imposed by the hardware (in the form of discrete configurations). The outcome is a highly flexible GPU kernel that provides a lot of controls to the developer, despite the aforementioned restrictions. The paper also pays particular attention to multiplications of very small matrices that cannot fully occupy the Tensor Core units. Our results show that the proposed design can outperform the highly optimized vendor routine for sizes up to 100 by factors between 1.2x and 10x using a Tesla V100 GPU. For extremely small matrices, the observed speedups range between 1.8x and 26x.
机译:矩阵乘法(GEMM)是稠密线性代数中最重要的运算。由于这是一个运算绑定操作,具有大量的数据重用性,因此来自不同科学领域的许多应用程序都将其最关键的性能阶段转换为使用GEMM。随着批处理线性代数的兴起,批处理GEMM运算在除密集线性求解器之外的其他领域(例如张量收缩,稀疏直接求解器和机器学习)变得越来越流行。特别是对于后者,降低精度的批处理GEMM(即FP16)已成为许多深度学习框架的核心操作。本文介绍了使用图形处理单元(GPU)的针对FP16算术(HGEMM)的优化批处理GEMM。我们提供了详细的设计策略,该策略可以利用最近在启用CUDA的GPU中引入的Tensor Core技术。开发的解决方案使用供应商提供的低级API进行优化设计,从而克服了硬件(以离散配置的形式)施加的限制。结果是高度灵活的GPU内核,尽管有上述限制,但仍为开发人员提供了许多控件。本文还特别注意无法完全占据Tensor Core单位的非常小的矩阵的乘法。我们的结果表明,使用Tesla V100 GPU,建议的设计在大小高达100的情况下可以胜过高度优化的供应商例程,其倍数在1.2到10倍之间。对于极小的矩阵,观察到的加速范围为1.8倍至26倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号