Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs

机译：开普勒GPU上通用矩阵矢量乘法（GEMV）的快速实现

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper proposes a fast implementation method for the general matrix-vector multiplication (GEMV) routine, which is one of the level-2 Basic Linear Algebra Subprograms (BLAS) subroutines, for a column-major and non-transposed matrix on NVIDIA Kepler architecture graphics processing units (GPUs). We began by implementing the GEMV kernel using typical blocking techniques for shared-memory and register along with 128-bit vector load/store instructions. In our initial investigation, we found that even though the kernel could approach actual peak GPU throughput at some matrix sizes, performance fluctuates periodically depending on the problem size. In our next step, we investigated the reason for the fluctuations using a performance model based on a thread-block scheduling mechanism, and then created a method of determining optimal thread-block sizes that avoids those fluctuations. As the results show, when run on two Kepler architecture GPUs, our single-precision GEMV (SGEMV) routine achieved better performance in terms of both throughput and performance stability (with respect to the problem size) when compared to existing implementations: CUBLAS 6.5, MAGMA 1.4.1 and KBLAS 1.0. Our implementation techniques can be used not only for SGEMV but also double-precision (DGEMV), single-complex (CGEMV), and double-complex (ZGEMV). While this paper discusses primarily Kepler architecture, we also explore the performance of proposal implementation on Maxwell architecture, which is the next generation of Kepler architecture.

机译：本文针对NVIDIA Kepler架构上的列主矩阵和非转置矩阵，提出了一种通用的矩阵向量乘法（GEMV）例程的快速实现方法，该例程是2级基本线性代数子程序（BLAS）子例程之一。图形处理单元（GPU）。我们首先使用共享内存和寄存器的典型阻塞技术来实现GEMV内核，并与128位向量加载/存储指令一起进行注册。在我们的初步调查中，我们发现即使内核可以在某些矩阵大小下达到实际的峰值GPU吞吐量，性能也会根据问题的大小而周期性地波动。在下一步中，我们使用基于线程块调度机制的性能模型调查了波动的原因，然后创建了一种确定避免这些波动的最佳线程块大小的方法。结果表明，与现有实现相比，我们的单精度GEMV（SGEMV）例程在两个Kepler体系结构GPU上运行时，在吞吐量和性能稳定性（相对于问题大小）方面都具有更好的性能。 MAGMA 1.4.1和KBLAS 1.0。我们的实现技术不仅可以用于SGEMV，而且可以用于双精度（DGEMV），单复数（CGEMV）和双复数（ZGEMV）。尽管本文主要讨论开普勒体系结构，但我们还探索了Maxwell架构（即下一代开普勒体系结构）上提案实施的性能。

著录项

来源
《Euromicro International Conference on Parallel, Distributed and Network-Based Processing》|2015年|642-650|共9页
会议地点
作者
Mukunoki Daichi; Imamura Toshiyuki; Takahashi Daisuke;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
GEMV; GPU; Matrix-vector multiplication; Performance optimization;

机译：GEMV; GPU;矩阵向量乘法;性能优化;

相似文献

外文文献
中文文献
专利

1. CUDA GPU libraries and novel sparse matrix-vector multiplication - implementation and performance enhancement in unstructured finite element computations [J] . Richard Haney, Ram Mohan International Journal of Computational Science and Engineering . 2019,第4期

机译：CUDA GPU库和新型稀疏矩阵 - 矢量乘法 - 非结构化有限元计算中的实现和性能增强
2. A Family of Bit-Representation-Optimized Formats for Fast Sparse Matrix-Vector Multiplication on the GPU [J] . Tang Wai Teng, Tan Wen Jun, Goh Rick Siow Mong, Parallel and Distributed Systems, IEEE Transactions on . 2015,第9期

机译：GPU上用于快速稀疏矩阵矢量乘法的一系列位表示优化格式
3. Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs [J] . Ruixing Wang, Tongxiang Gu, Ming Li Journal of Computer and Communications . 2017,第6期

机译：基于GPU稀疏矩阵矢量乘法统计的性能预测
4. Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs [C] . Daichi Mukunoki, Daisuke Takahashi International conference on computational science and its applications . 2013

机译：NVIDIA Kepler体系结构GPU上CRS格式的稀疏矩阵-矢量乘法的优化
5. Design and implementation of fast motion estimation in modern video compression on GPU [D] . Yi, Zhaohua 2015

机译：GPU上现代视频压缩中快速运动估计的设计与实现
6. Toward Fast and Accurate Binding Affinity Prediction with pmemdGTI: An Efficient Implementation of GPU-Accelerated Thermodynamic Integration [O] . Tai-Sung Lee, Yuan Hu, Brad Sherborne, -1

机译：使用pmemdGTI进行快速准确的结合亲和力预测：GPU加速的热力学集成的有效实现
7. Shuffle Reduction Based Sparse Matrix-Vector Multiplication on Kepler GPU [O] . Yuan Tao, Huang Zhi-Bin 2016

机译：基于Shuffle缩减基于稀疏矩阵 - 矢量乘法在开普勒GPU上

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs

摘要

著录项

相似文献

相关主题

期刊订阅