首页> 外文会议>2011 International Conference on Parallel Processing >Optimizing SpMV for Diagonal Sparse Matrices on GPU
【24h】

Optimizing SpMV for Diagonal Sparse Matrices on GPU

机译:在GPU上为对角稀疏矩阵优化SpMV

获取原文

摘要

Sparse Matrix-Vector multiplication (SpMV) is an important computational kernel in scientific applications. Its performance highly depends on the nonzero distribution of sparse matrices. In this paper, we propose a new storage format for diagonal sparse matrices, defined as Compressed Row Segment with Diagonal-pattern (CRSD). In CRSD, we design diagonal patterns to represent the diagonal distribution. As the Graphics Processing Units (GPUs) have tremendous computation power and OpenCL makes them more suitable for the scientific computing, we implement the SpMV for CRSD format on the GPUs using OpenCL. Since the OpenCL kernels are complied at runtime, we design the code generator to produce the codelets for all diagonal patterns after storing matrices into CRSD format. Specifically, the generated codelets already contain the index information of nonzeros, which reduces the memory pressure during the SpMV operation. Furthermore, the code generator also utilizes property of memory architecture and thread schedule on the GPUs to improve the performance. In the evaluation, we select four storage formats from prior state-of-the-art implementations (Bell and Garland, 2009) on GPU. Experimental results demonstrate that the speedups reach up to 1.52 and 1.94 in comparison with the optimal implementation of the four formats for the double and single precision respectively. We also evaluate on a two-socket quad-core Intel Xeon system. The speedups reach up to 11.93 and 12.79 in comparison with CSR format under 8 threads for the double and single precision respectively.
机译:稀疏矩阵向量乘法(SpMV)是科学应用中的重要计算内核。它的性能高度取决于稀疏矩阵的非零分布。在本文中,我们为对角稀疏矩阵提出了一种新的存储格式,定义为具有对角线模式的压缩行段(CRSD)。在CRSD中,我们设计对角线图案来表示对角线分布。由于图形处理单元(GPU)具有强大的计算能力,而OpenCL使其更适合于科学计算,因此我们使用OpenCL在GPU上实现SpMV for CRSD格式。由于OpenCL内核是在运行时编译的,因此我们将代码生成器设计为在将矩阵存储为CRSD格式后为所有对角线模式生成小码。具体地,所生成的小码已经包含非零的索引信息,这减少了SpMV操作期间的存储压力。此外,代码生成器还利用GPU上的内存体系结构和线程计划的属性来提高性能。在评估中,我们从GPU上的现有最新实现中选择了四种存储格式(Bell和Garland,2009年)。实验结果表明,与分别针对双精度和单精度的四种格式的最佳实现方式相比,加速比分别达到了1.52和1.94。我们还评估了两路四核Intel Xeon系统。与8位线程下的CSR格式相比,双精度和单精度的加速比分别达到11.93和12.79。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号