首页> 外文会议>Euromicro International Conference on Parallel, Distributed and Network-Based Processing >Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs
【24h】

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs

机译:开普勒GPU上通用矩阵矢量乘法(GEMV)的快速实现

获取原文

摘要

This paper proposes a fast implementation method for the general matrix-vector multiplication (GEMV) routine, which is one of the level-2 Basic Linear Algebra Subprograms (BLAS) subroutines, for a column-major and non-transposed matrix on NVIDIA Kepler architecture graphics processing units (GPUs). We began by implementing the GEMV kernel using typical blocking techniques for shared-memory and register along with 128-bit vector load/store instructions. In our initial investigation, we found that even though the kernel could approach actual peak GPU throughput at some matrix sizes, performance fluctuates periodically depending on the problem size. In our next step, we investigated the reason for the fluctuations using a performance model based on a thread-block scheduling mechanism, and then created a method of determining optimal thread-block sizes that avoids those fluctuations. As the results show, when run on two Kepler architecture GPUs, our single-precision GEMV (SGEMV) routine achieved better performance in terms of both throughput and performance stability (with respect to the problem size) when compared to existing implementations: CUBLAS 6.5, MAGMA 1.4.1 and KBLAS 1.0. Our implementation techniques can be used not only for SGEMV but also double-precision (DGEMV), single-complex (CGEMV), and double-complex (ZGEMV). While this paper discusses primarily Kepler architecture, we also explore the performance of proposal implementation on Maxwell architecture, which is the next generation of Kepler architecture.
机译:本文针对NVIDIA Kepler架构上的列主矩阵和非转置矩阵,提出了一种通用的矩阵向量乘法(GEMV)例程的快速实现方法,该例程是2级基本线性代数子程序(BLAS)子例程之一。图形处理单元(GPU)。我们首先使用共享内存和寄存器的典型阻塞技术来实现GEMV内核,并与128位向量加载/存储指令一起进行注册。在我们的初步调查中,我们发现即使内核可以在某些矩阵大小下达到实际的峰值GPU吞吐量,性能也会根据问题的大小而周期性地波动。在下一步中,我们使用基于线程块调度机制的性能模型调查了波动的原因,然后创建了一种确定避免这些波动的最佳线程块大小的方法。结果表明,与现有实现相比,我们的单精度GEMV(SGEMV)例程在两个Kepler体系结构GPU上运行时,在吞吐量和性能稳定性(相对于问题大小)方面都具有更好的性能。 MAGMA 1.4.1和KBLAS 1.0。我们的实现技术不仅可以用于SGEMV,而且可以用于双精度(DGEMV),单复数(CGEMV)和双复数(ZGEMV)。尽管本文主要讨论开普勒体系结构,但我们还探索了Maxwell架构(即下一代开普勒体系结构)上提案实施的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号