Sparse matrix-matrix multiplication on modern architectures

机译：现代建筑上的稀疏矩阵矩阵乘法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, respectively, among other applications find usage in various matrix formulations of graph problems. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting the output size. It is even more challenging when combined with the GPU architectural constraints such as memory accesses, limited shared memory, strict SIMD and thread execution. To address these challenges on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations targeted at sparse matrices. Our experiments indicate that the Row-Row formulation, which mostly outperforms the other formulations, is 3.5x faster on average compared to an optimized multi-core implementation in the Intel MKL library. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU also. In this direction, we present heuristics to find the right amount of work division between the CPU and the GPU. Our hybrid row-row formulation of the spgemm operation performs 5.5x faster on average when compared to the optimized multi-core implementation in the Intel MKL library. Our experience indicates that it is difficult to identify right amount of work division between the CPU and the GPU. We therefore investigate a subclass of sparse matrices, band matrices, and present an analytical method to identify a good work division when multiplying two band matrices. Our GPU csrmm operation performs 2.5x faster on average when compared to a corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.

机译：在其他应用中，稀疏矩阵稀疏/密集矩阵乘法spgemm和csrmm分别用于图问题的各种矩阵形式。考虑到执行图形问题的困难以及图形与矩阵之间的对偶性，诸如spgemm和csrmm之类的计算最近引起了HPC社区的关注。这些计算带来了挑战，例如负载平衡，计算的不规则性质以及难以预测输出大小。当与诸如内存访问，有限的共享内存，严格的SIMD和线程执行之类的GPU架构约束结合在一起时，挑战就更大了。为了解决GPU上的这些挑战，我们评估了矩阵乘法的三种可能的变化（行列，列行，行行），并针对稀疏矩阵执行了适当的优化。我们的实验表明，与英特尔MKL库中经过优化的多核实施相比，行-行格式在大多数情况下都优于其他格式，平均速度快3.5倍。我们将行行公式扩展到同时使用CPU的CPU + GPU混合算法。在这个方向上，我们提出启发式方法以找到CPU和GPU之间正确的工作量。与英特尔MKL库中的优化多核实施相比，我们的spgemm操作的混合行行公式平均执行速度快5.5倍。我们的经验表明，很难确定CPU和GPU之间正确的工作量。因此，我们研究了稀疏矩阵，带状矩阵的子类，并提出了一种分析方法，用于在将两个带状矩阵相乘时确定良好的工作分区。与cusparse库中的相应实现相比，我们的GPU csrmm操作平均快2.5倍，其性能优于Intel MKL库的实现。

著录项

来源
《2012 19th International Conference on High Performance Computing》|2012年|p.1-10|共10页
会议地点 Pune(IN)
作者
Matam Kiran; Krishna Bharadwaj Indarapu Siva Rama; Kothapalli Kishore;
展开▼
作者单位

Center for Security, Theory and Algorithmic Research (CSTAR), International Institute of Information Technology, Hyderabad Gachibowli, Hyderabad, India, 500 032;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
GPGPU; band matrices; hybrid algorithms; sparse matrix multiplication;

机译：GPGPU;带矩阵;混合算法;稀疏矩阵乘法;;

相似文献

外文文献
中文文献
专利

1. Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures [J] . Deveci Mehmet, Trott Christian, Rajamanickam Sivasankaran Parallel Computing . 2018,第octa期

机译：适用于多核和GPU架构的多线程稀疏矩阵矩阵乘法
2. MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES [J] . Gremse Felix, Kuepper Kerstin, Naumann Uwe SIAM Journal on Scientific Computing . 2018,第4期

机译：在许多核心架构上的行合并的内存高效的稀疏矩阵乘法
3. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures [J] . Kadir Akbudak, Cevdet Aykanat IEEE Transactions on Parallel and Distributed Systems . 2017,第8期

机译：在多核架构上利用稀疏矩阵-矩阵乘法的局部性
4. Sparse matrix-matrix multiplication on modern architectures [C] . Matam Kiran, Krishna Bharadwaj Indarapu Siva Rama, Kothapalli Kishore International Conference on High Performance Computing;Workshop on Performance Engineering and Applications;Workshop on Parallel Algorithms and Software for Analysis of Massive Graphs;International Workshop on Cloud Computing Applications;Workshop on Massive Data Analytics on Scalable Systems . 2012

机译：现代架构上稀疏矩阵矩阵乘法
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. A unified framework for sparse non-negative least squares using multiplicative updates and the non-negative matrix factorization problem [O] . Igor Fedorov, Alican Nalci, Ritwik Giri, -1

机译：使用乘法更新和非负矩阵分解问题的稀疏非负最小二乘的统一框架
7. Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures [O] . Deveci, Mehmet, Trott, Christian, Rajamanickam, Sivasankaran 2018

机译：用于多核和GpU的多线程稀疏矩阵 - 矩阵乘法架构

Sparse matrix-matrix multiplication on modern architectures

摘要

著录项

相似文献

相关主题

期刊订阅