Performance, Design, and Autotuning of Batched GEMM for GPUs

机译：用于GPU的批处理GEMM的性能，设计和自动调整

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in dense linear algebra, and is the key component for obtaining high performance in most LAPACK routines. As batched computations on relatively small problems continue to gain interest in many scientific applications, a need arises for a high performance GEMM kernel for batches of small matrices. Such a kernel should be well designed and tuned to handle small sizes, and to maintain high performance for realistic test cases found in the higher level LAPACK routines, and scientific computing applications in general. This paper presents a high performance batched GEMM kernel on Graphics Processing Units (GPUs). We address batched problems with both fixed and variable sizes, and show that specialized GEMM designs and a comprehensive autotuning process are needed to handle problems of small sizes. For most performance tests reported in this paper, the proposed kernels outperform state-of-the-art approaches using a K40c GPU.

机译：通用矩阵矩阵乘法（GEMM）是稠密线性代数中最重要的数值内核，并且是在大多数LAPACK例程中获得高性能的关键组件。随着对相对较小问题的批处理计算继续在许多科学应用中引起兴趣，因此需要用于批处理小矩阵的高性能GEMM内核。这样的内核应该经过精心设计和调整，以处理较小的尺寸，并为高级LAPACK例程和一般的科学计算应用程序中发现的实际测试用例保持高性能。本文介绍了图形处理单元（GPU）上的高性能批处理GEMM内核。我们解决了固定大小和可变大小的批量问题，并表明需要专业的GEMM设计和全面的自动调整过程来解决小尺寸问题。对于本文中报告的大多数性能测试，所建议的内核均优于使用K40c GPU的最新方法。

著录项

来源
《International conference on high performance computing》|2016年|21-38|共18页
会议地点
作者
Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Jack Dongarra;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
GEMM; Batched GEMM; HPC; GPU computing; Autotuning;

机译：GEMM;批量GEMM; HPC; GPU计算;自动调节;

相似文献

外文文献
中文文献
专利

1. Autotuning GEMM Kernels for the Fermi GPU [J] . Kurzak Jakub, Tomov Stanimire, Dongarra Jack Parallel and Distributed Systems, IEEE Transactions on . 2012,第11期

机译：为Fermi GPU自动调整GEMM内核
2. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators [J] . Jack Dongarra, Mark Gates, Jakub Kurzak, Proceedings of the IEEE . 2018,第11期

机译：使用GPU硬件加速器自动调谐数值密集线性代数以进行批处理计算
3. Thread Batching for High-performance Energy-efficient GPU Memory Design [J] . ACM Journal on Emerging Technologies in Computing Systems . 2019,第4期

机译：用于高性能节能GPU存储器设计的线程批量
4. Performance, Design, and Autotuning of Batched GEMM for GPUs [C] . Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, ISC High Performance Conference . 2016

机译：用于GPU的批量宝石的性能，设计和自动调整
5. A Performance Model for GPU Architectures: Analysis and Design of Fundamental Algorithms [D] . Karsin, Ben. 2018

机译：GPU架构的性能模型：基本算法的分析和设计
6. A high-throughput media design approach for high performance mammalian fed-batch cultures [O] . Yolande Rouiller, Arnaud Périlleux, Natacha Collet, 2013

机译：高性能哺乳动物补料分批培养的高通量培养基设计方法
7. Autotuning gemm kernels for the fermi gpu [O] . Jakub Kurzak, Stanimire Tomov, Jack Dongarra, 2014

机译：自动调整Fermi GPU的gemm内核

Performance, Design, and Autotuning of Batched GEMM for GPUs

摘要

著录项

相似文献

相关主题

期刊订阅