首页> 外文期刊>Parallel Computing >CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations
【24h】

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

机译:使用原子运算在GPU上启用CUDA的稀疏矩阵向量乘法

获取原文
获取原文并翻译 | 示例

摘要

Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multithreaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.
机译:GPU上的稀疏矩阵向量乘法(SpMV)的现有格式优于其在多核CPU上的相应实现。在本文中,我们提出了一种称为切片COO(SCOO)的新格式,以及一种有效的CUDA实现,可使用原子操作在GPU上执行SpMV。我们使用大型稀疏矩阵将SCOO性能与NVIDIA Cusp库的现有格式进行比较。我们对单精度浮点矩阵的结果表明,在单个GPU上,SCOO优于所有测试矩阵的COO和CSR格式,以及所有测试非结构化矩阵的HYB格式。此外,我们的双GPU实施平均可实现94%的效率。由于现有的启用CUDA的GPU在双精度浮点数上进行原子运算的性能较低,因此对于每个非结构化矩阵,用于双精度的SCOO实现并不能始终胜过其他格式。总体而言,在Tesla C2075上,单精度(双精度)的测试基准数据集的SCOO平均提速为CSR的3.33(1.56),COO的5.25(2.42),HYB的2.39(1.37)。此外,与Sandy-Bridge CPU的比较表明,Fermi GPU上的SCOO优于i7-2700 K上Intel MKL库的多线程CSR实现,单(双)的系数为5.5(2.3)和18(12.7)之间。精确。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号