首页> 外文会议>International Conference on Parallel Processing >Optimizing SpMV for Diagonal Sparse Matrices on GPU
【24h】

Optimizing SpMV for Diagonal Sparse Matrices on GPU

机译:优化GPU上的对角线稀疏矩阵的SPMV

获取原文

摘要

Sparse Matrix-Vector multiplication (SpMV) is an important computational kernel in scientific applications. Its performance highly depends on the nonzero distribution of sparse matrices. In this paper, we propose a new storage format for diagonal sparse matrices, defined as Compressed Row Segment with Diagonal-pattern (CRSD). In CRSD, we design diagonal patterns to represent the diagonal distribution. As the Graphics Processing Units (GPUs) have tremendous computation power and OpenCL makes them more suitable for the scientific computing, we implement the SpMV for CRSD format on the GPUs using OpenCL. Since the OpenCL kernels are complied at runtime, we design the code generator to produce the codelets for all diagonal patterns after storing matrices into CRSD format. Specifically, the generated codelets already contain the index information of nonzeros, which reduces the memory pressure during the SpMV operation. Furthermore, the code generator also utilizes property of memory architecture and thread schedule on the GPUs to improve the performance. In the evaluation, we select four storage formats from prior state-of-the-art implementations (Bell and Garland, 2009) on GPU. Experimental results demonstrate that the speedups reach up to 1.52 and 1.94 in comparison with the optimal implementation of the four formats for the double and single precision respectively. We also evaluate on a two-socket quad-core Intel Xeon system. The speedups reach up to 11.93 and 12.79 in comparison with CSR format under 8 threads for the double and single precision respectively.
机译:稀疏矩阵 - 矢量乘法(SPMV)是科学应用中的重要计算内核。它的性能高度取决于稀疏矩阵的非零分布。在本文中,我们提出了一种用于对角线稀疏矩阵的新存储格式,定义为具有对角线图案(CRSD)的压缩行段。在CRSD中,我们设计对角线模式以表示对角线分布。由于图形处理单元(GPU)具有巨大的计算功率和OpenCL使它们更适合于科学计算,我们使用OpenCL在GPU上实现SPMV的CRSD格式。由于OpenCL内核在运行时符合运行时,我们设计代码生成器以在将矩阵存储到CRSD格式之后为所有对角线模式生成代码单元。具体地,所生成的Codelet已经包含非安利斯的索引信息,这在SPMV操作期间降低了存储器压力。此外,代码生成器还利用GPU上的内存架构和线程计划的属性来提高性能。在评估中,我们在GPU上从先前最先进的实现(Bell和Garland,2009)中选择四种存储格式。实验结果表明,与分别为双格式和单精度的四种格式的最佳实现相比,加速度高达1.52和1.94。我们还评估了一个双套接字四核英特尔Xeon系统。加速度高达11.93和12.79,相比之下,分别为双线和单个精度的8个线程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号