【24h】

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

机译:在图形处理单元上实现稀疏矩阵向量乘法的有效方法

获取原文
获取原文并翻译 | 示例

摘要

Sparse matrix vector multiplication, SpMV, is often a performance bottleneck in iterative solvers. Recently, Graphics Processing Units, GPUs, have been deployed to enhance the performance of this operation. We present a blocked version of the Transposed Jagged Diagonal storage format which is tailored for GPUs, BTJAD. We develop a highly optimized SpMV kernel that takes advantage of the properties of the BTJAD storage format and reuses loaded values of the source vector in the registers of a GPU. Using 62 matrices with different sparsity patterns and executing on an NVIDIA Tesla T10 GPU, we compare the performance of our kernel with that of the SpMV kernels in NVIDIA's library. Our kernel achieves superior execution throughputs for matrices that are non-uniform in their nonzero row lengths, outperforming the best available kernels by up to 4.67x. When executing on the Fermi class GeForce GTX480 GPU which has a larger register file size, the maximum speedup achieved by our kernel improves to 6.6x.
机译:稀疏矩阵向量乘法SpMV通常是迭代求解器中的性能瓶颈。最近,已经部署了图形处理单元GPU,以增强此操作的性能。我们介绍了专为GPU(BTJAD)量身定做的“交错锯齿对角线”存储格式的阻止版本。我们开发了高度优化的SpMV内核,该内核利用BTJAD存储格式的特性,并在GPU的寄存器中重用了源向量的加载值。我们使用62种具有不同稀疏模式的矩阵并在NVIDIA Tesla T10 GPU上执行,我们将内核的性能与NVIDIA库中的SpMV内核的性能进行了比较。对于非零行长不一致的矩阵,我们的内核可实现卓越的执行吞吐量,其性能比最佳可用内核高出4.67倍。在具有更大寄存器文件大小的Fermi类GeForce GTX480 GPU上执行时,我们内核实现的最大加速提高到6.6倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号