首页> 外文期刊>Parallel Computing >Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices
【24h】

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices

机译:很小矩阵的高性能矩阵矩阵乘法的算法和优化技术

获取原文
获取原文并翻译 | 示例

摘要

Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen. (C) 2018 Elsevier B.V. All rights reserved.
机译:用BLAS来表达科学计算,尤其是一般的密集矩阵-矩阵乘法(GEMM),对于获得跨架构的高性能可移植性至关重要。但是,在现有库中,大小小于32的小矩阵的GEMM没有得到充分优化。我们考虑了许多小型GEMM的计算及其在众多计算机体系结构(包括Intel CPU,ARM,IBM,Intel Xeon Phi和GPU)中的性能可移植性。这些计算通常发生在大数据分析,机器学习,高阶有限元方法(FEM)等应用程序中。 GEMM按单个批处理例程分组在一起。对于这些情况,我们介绍了专门针对感兴趣的矩阵大小和体系结构的算法及其优化技术。我们导出了一个性能模型,并表明可以对新开发进行调整,以使性能达到任何感兴趣的体系结构的最佳性能的90%之内。例如,在尺寸为32的平方矩阵的V100 GPU上,我们在双精度算术中实现了大约1600 gigaFLOP / s的执行速度,这是在V100 GPU上进行此计算的理论得出的峰值的95%。我们还表明,这些结果优于目前可用的最新实现,例如供应商调整的数学库(包括Intel MKL和NVIDIA CUBLAS)以及开源库(如OpenBLAS和Eigen)。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号