首页> 外文会议>IEEE International Conference on Parallel and Distributed Systems >Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format
【24h】

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

机译:使用定制稀疏存储格式的高效稀疏密集矩阵矩阵乘法

获取原文

摘要

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.
机译:稀疏矩阵到密集矩阵(SPDM)的乘法广泛应用于科学计算和机器学习等许多领域。但是,现有的工作下方看起来像GPU这样的现代MDERCORE架构上的SPDM的性能优化。存储数据结构可帮助稀疏矩阵以存储器节省存储格式,但由于不规则的数据访问稀疏结构的数据访问,它们会使SPDM对现代GPU上的性能进行困难,这导致资源利用率较低,性能较差。在本文中,我们指的是GPU的屋顶性能模型设计一个名为GCoOSPDM的高效SPDM算法,其中我们利用了播放全球内存访问,快速共享的内存重用以及全局内存流量的每个字节的更多操作。使用包括公共数据集和随机生成的矩阵的大量矩阵对三个NVIDIA GPU(即,GTX 980,GTX TITAN X PASCAL和TESLA P100)进行评估实验。实验结果表明,GCOOPDM在许多矩阵中达到了NVIDIA的图书馆Cusparse 1.5-8x的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号