首页> 外文会议>IEEE International Conference on High Performance Computing and Communication >An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units
【24h】

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

机译:在图形处理单元上实现稀疏矩阵矢量乘法的有效方法

获取原文

摘要

Sparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed Jagged Diagonal storage format which is tailored for GPUs, BTJAD. We develop a highly optimized SpMV kernel that takes advantage of the properties of the BTJAD storage format and reuses loaded values of the source vector in the registers of a GPU. Using 62 matrices with different sparsity patterns and executing on an NVIDIA Tesla T10 GPU, we compare the performance of our kernel with that of the SpMV kernels in NVIDIA's library. Our kernel achieves superior execution throughputs for matrices that are non-uniform in their nonzero row lengths, outperforming the best available kernels by up to 4.67x. When executing on the Fermi class GeForce GTX480 GPU which has a larger register file size, the maximum speedup achieved by our kernel improves to 6.6x.
机译:稀疏矩阵向量乘法,SPMV,通常是迭代求解器中的性能瓶颈。最近,图形处理单元GPU已经部署以增强该操作的性能。我们介绍了一个封闭的锯齿状对角线存储格式的版本,该存储格式为GPU,BTJAD量身定制。我们开发了一种高度优化的SPMV内核,它利用了BTJAD存储格式的属性,并在GPU的寄存器中重用源向量的加载值。使用具有不同稀疏模式的62个矩阵和在NVIDIA TESLA T10 GPU上执行,我们将内核的性能与NVIDIA的库中的SPMV内核进行比较。我们的内核在非统计行长度中实现了矩阵的卓越的执行吞吐量,优先于最佳可用内核高达4.67倍。在具有较大寄存器文件大小的Fermi类GeForce GTX480 GPU上执行时,我们的内核实现的最大加速度将改善为6.6倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号