首页> 外文会议>International conference on high performance computing >Performance, Design, and Autotuning of Batched GEMM for GPUs
【24h】

Performance, Design, and Autotuning of Batched GEMM for GPUs

机译:用于GPU的批处理GEMM的性能,设计和自动调整

获取原文

摘要

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.
机译:通用矩阵矩阵乘法(GEMM)是稠密线性代数中最重要的数值内核,并且是在大多数LAPACK例程中获得高性能的关键组件。随着对相对较小问题的批处理计算继续在许多科学应用中引起兴趣,因此需要用于批处理小矩阵的高性能GEMM内核。这样的内核应该经过精心设计和调整,以处理较小的尺寸,并为高级LAPACK例程和一般的科学计算应用程序中发现的实际测试用例保持高性能。本文介绍了图形处理单元(GPU)上的高性能批处理GEMM内核。我们解决了固定大小和可变大小的批量问题,并表明需要专业的GEMM设计和全面的自动调整过程来解决小尺寸问题。对于本文中报告的大多数性能测试,所建议的内核均优于使用K40c GPU的最新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号