Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

机译：使用定制稀疏存储格式的高效稀疏密集矩阵矩阵乘法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing work under-looks the performance optimization of SpDM on modern manycore architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse, and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal, and Tesla P100) using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8x speedup over Nvidia's library cuSPARSE in many matrices.

机译：稀疏矩阵到密集矩阵（SPDM）的乘法广泛应用于科学计算和机器学习等许多领域。但是，现有的工作下方看起来像GPU这样的现代MDERCORE架构上的SPDM的性能优化。存储数据结构可帮助稀疏矩阵以存储器节省存储格式，但由于不规则的数据访问稀疏结构的数据访问，它们会使SPDM对现代GPU上的性能进行困难，这导致资源利用率较低，性能较差。在本文中，我们指的是GPU的屋顶性能模型设计一个名为GCoOSPDM的高效SPDM算法，其中我们利用了播放全球内存访问，快速共享的内存重用以及全局内存流量的每个字节的更多操作。使用包括公共数据集和随机生成的矩阵的大量矩阵对三个NVIDIA GPU（即，GTX 980，GTX TITAN X PASCAL和TESLA P100）进行评估实验。实验结果表明，GCOOPDM在许多矩阵中达到了NVIDIA的图书馆Cusparse 1.5-8x的加速。

著录项

来源
《IEEE International Conference on Parallel and Distributed Systems》|2020年|19-26|共8页
会议地点
作者
Shaohuai Shi; Qiang Wang; Xiaowen Chu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Scientific computing; Memory management; Machine learning; Libraries; Sparse matrices; Resource management; Optimization;

机译：科学计算;记忆管理;机器学习;图书馆;稀疏矩阵;资源管理;优化;

相似文献

外文文献
中文文献
专利

1. Register-based Implementation of the Sparse General Matrix-Matrix Multiplication on GPUs [J] . Junhong Liu, Xin He, Weifeng Liu, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2018,第1期

机译：基于寄存器的GPU稀疏常规矩阵矩阵乘法的实现
2. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors [J] . Weifeng Liu, Brian Vinter Journal of Parallel and Distributed Computing . 2015,第NOVa期

机译：GPU和异构处理器上的通用稀疏矩阵矩阵乘法的框架
3. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures [J] . Deveci Mehmet, Trott Christian, Rajamanickam Sivasankaran Parallel Computing . 2018,第octa期

机译：适用于多核和GPU架构的多线程稀疏矩阵矩阵乘法
4. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format [C] . Greathouse Joseph L., Daga Mayank International Conference for High Performance Computing, Networking, Storage and Analysis . 2014

机译：使用CSR存储格式的GPU上的高效稀疏矩阵矢量乘法
5. Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU [D] . Mahmoud, Mohammed. 2018

机译：为GPU上的配置交互稀疏矩阵开发新的存储格式和基于Warp的SpMV内核
6. Fast and efficient fully 3D PET image reconstruction using sparse system matrix factorization with GPU acceleration [O] . Jian Zhou, Jinyi Qi -1

机译：使用具有GpU加速稀疏系统矩阵分解快速高效的全3D pET图像重建
7. Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format [O] . Shaohuai Shi, Qiang Wang, Xiaowen Chu 2020

机译：使用定制稀疏存储格式的高效稀疏密集矩阵矩阵乘法

Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

摘要

著录项

相似文献

相关主题

期刊订阅