Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

Masliah I.; Abdelfattah A.; Haidar A.; Tomov S.; Baboulin M.; Falcou J.; Dongarra J.

首页> 外文期刊>Parallel Computing >Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

【24h】

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

机译：很小矩阵的高性能矩阵矩阵乘法的算法和优化技术

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. (C) 2018 Elsevier B.V. All rights reserved.

机译：用BLAS来表达科学计算，尤其是一般的密集矩阵-矩阵乘法（GEMM），对于获得跨架构的高性能可移植性至关重要。但是，在现有库中，大小小于32的小矩阵的GEMM没有得到充分优化。我们考虑了许多小型GEMM的计算及其在众多计算机体系结构（包括Intel CPU，ARM，IBM，Intel Xeon Phi和GPU）中的性能可移植性。这些计算通常发生在大数据分析，机器学习，高阶有限元方法（FEM）等应用程序中。 GEMM按单个批处理例程分组在一起。对于这些情况，我们介绍了专门针对感兴趣的矩阵大小和体系结构的算法及其优化技术。我们导出了一个性能模型，并表明可以对新开发进行调整，以使性能达到任何感兴趣的体系结构的最佳性能的90％之内。例如，在尺寸为32的平方矩阵的V100 GPU上，我们在双精度算术中实现了大约1600 gigaFLOP / s的执行速度，这是在V100 GPU上进行此计算的理论得出的峰值的95％。我们还表明，这些结果优于目前可用的最新实现，例如供应商调整的数学库（包括Intel MKL和NVIDIA CUBLAS）以及开源库（如OpenBLAS和Eigen）。（C）2018 Elsevier B.V.保留所有权利。

著录项

来源
《Parallel Computing》 |2019年第1期|1-21|共21页
作者
Masliah I.; Abdelfattah A.; Haidar A.; Tomov S.; Baboulin M.; Falcou J.; Dongarra J.;
展开▼
作者单位

Univ Manchester, Manchester, Lancs, England;

Univ Paris Sud, Paris, France;

Inria Bordeaux, Talence, France;

Univ Tennessee, Innovat Comp Lab, Knoxville, TN USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Matrix-matrix product; Batched GEMM; Small matrices; HPC; Autotuning; Optimization;

机译：矩阵产品;批量GEMM;小矩阵;HPC;自整定;优化;

相似文献

外文文献
中文文献
专利

1. Semiempirical Molecular Dynamics (SEMD) I: Midpoint-Based Parallel Sparse Matrix-Matrix Multiplication Algorithm for Matrices with Decay [J] . Weber Valery, Laino Teodoro, Pozdneev Alexander, Journal of chemical theory and computation: JCTC . 2015,第7期

机译：半经验分子动力学（SEMD）I：具有衰减的矩阵的基于中点的并行稀疏矩阵-矩阵乘法算法
2. Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication [J] . Liu Junhong, He Xin, Liu Weifeng, International journal of parallel programming . 2019,第3期

机译：并行稀疏矩阵-矩阵乘法的寄存器感知优化
3. TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs [J] . Cody Rivera, Jieyang Chen, Nan Xiong, Journal of Parallel and Distributed Computing . 2021,第May期

机译：TSM2X：GPU上的高性能高瘦矩阵矩阵乘法
4. High-Performance Matrix-Matrix Multiplications of Very Small Matrices [C] . Ian Masliah, Ahmad Abdelfattah, A. Haidar, International conference on parallel and distributed comuting . 2016

机译：很小矩阵的高性能矩阵-矩阵乘法
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. Using alternative programming algorithms and techniques for optimizing performance of the Casemix software [O] . H Reeza, A Zafar, AJ Hamzah, 2012

机译：使用替代的编程算法和技术来优化Casemix软件的性能
7. Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices [O] . I. Masliah, A. Abdelfattah, A. Haidar, 2019

机译：非常小矩阵高性能矩阵矩阵乘法的算法和优化技术

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅