首页> 外文期刊>Journal of Parallel and Distributed Computing >TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs
【24h】

TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs

机译:TSM2X:GPU上的高性能高瘦矩阵矩阵乘法

获取原文
获取原文并翻译 | 示例

摘要

Linear algebra operations have been widely used in big data analytics and scientific computations. Many works have been done on optimizing linear algebra operations on GPUs with regular-shaped input. However, few works focus on fully utilizing GPU resources when the input is not regular-shaped. Current optimizations do not consider fully utilizing the memory bandwidth and computing power; therefore, they can only achieve sub-optimal performance. In this paper, we propose two efficient algorithms - TSM2R and TSM2L - for two classes of tall-and-skinny matrix-matrix multiplications on GPUs. Both of them focus on optimizing linear algebra operation with at least one of the input matrices tall-and-skinny. Specifically, TSM2R is designed for a large regular-shaped matrix multiplying a tall-and-skinny matrix, while TSM2L is designed for a tall-and-skinny matrix multiplying a small regular-shaped matrix. We implement our proposed algorithms and test on several modern NVIDIA GPU micro-architectures. Experiments show that, compared to the current state-of-the-art works, (1) TSM2R speeds up the computation by 1.6x on average and improves the memory bandwidth utilization and computing power utilization by 18.1% and 20.5% on average, respectively, when the regular-shaped matrix size is relatively large or medium; and (2) TSM2L speeds up the computation by 1.9x on average and improves the memory bandwidth utilization by up to 9.3% on average when the regular-shaped matrix size is relatively small.
机译:线性代数业务已广泛用于大数据分析和科学计算。在具有常规输入的GPU上优化线性代数操作,已经完成了许多作品。但是,当输入不是常规时,很少有效地关注完全利用GPU资源。电流优化不考虑充分利用存储带宽和计算能力;因此,它们只能实现次优性能。在本文中,我们提出了两种高效的算法 - TSM2R和TSM2L - 用于GPU上的两类高瘦矩阵矩阵乘法。它们都专注于优化线性代数操作,其中至少一个输入矩阵高度瘦细。具体地,TSM2R专为大规模矩阵乘以高且瘦矩阵的大型矩阵,而TSM2L则设计用于乘以小常规形状矩阵的高且瘦矩阵。我们在几种现代NVIDIA GPU微型体系结构上实施了我们所提出的算法和测试。实验表明,与目前的最先进的工作相比,(1)TSM2R平均速度加速1.6倍,并分别将内存带宽利用率和计算电力利用率平均提高18.1%和20.5%当常规矩阵尺寸相对较大或介质时; (2)TSM2L平均将计算加速1.9倍,当常规形状矩阵大小相对较小时,平均内存带宽利用率高达9.3%。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号