【24h】

Efficient Triangular Matrix Vector Multiplication on the GPU

机译:高效三角矩阵矢量乘法对GPU

获取原文

摘要

The main purpose of this paper is to present a very efficient GPU implementation to compute the trmv, the product of a triangular matrix and a vector. Usually, developers use cuBLAS, a linear algebra library optimized for each of various generations of GPUs, to compute the trmv. To attain better performance than cuBLAS, our GPU implementation of the trmv uses various acceleration technique for latest GPUs. More specifically, our GPU implementation has the following features: (1) only one kernel is called; (2) maximum number of threads are invoked; (3) all memory access to the global memory is coalesced; (4) all memory access to the shared memory has no bank conflict; and (5) shared memory access is minimized by a warp shuffle function. Experimental results for five generations of NVIDIA GPUs for matrices of sizes from 32 × 32 to 16K × 16K for fp32 show that our GPU implementation is faster than cuBLAS and muBLAS for almost all matrix sizes and GPU generations.
机译:本文的主要目的是呈现一个非常有效的GPU实现来计算TRMV,三角形矩阵的乘积和载体。 通常,开发人员使用Cublas,针对各个GPU的每个GPU优化的线性代数库,以计算TRMV。 为了获得比CUBLA更好的性能,我们的GPU实现TRMV使用各种加速技术进行最新的GPU。 更具体地说,我们的GPU实现具有以下功能:(1)仅调用一个内核; (2)调用最大线程数; (3)对全局内存的所有内存访问合并; (4)对共享内存的所有内存访问都没有银行冲突; (5)通过Warp Shuffle功能最小化共享内存访问。 对于FP32的32×32至16K×16K的尺寸为32×32至16K×16k的五代NVIDIA GPU的实验结果表明,我们的GPU实现比Cublas和Mublas几乎所有矩阵大小和GPU世代都更快。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号