MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES

Gremse Felix; Kuepper Kerstin; Naumann Uwe

首页> 外文期刊>SIAM Journal on Scientific Computing >MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES

【24h】

MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES

机译：在许多核心架构上的行合并的内存高效的稀疏矩阵乘法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Sparse matrix-matrix multiplication (SpMM) is used for many computational tasks including algebraic multigrid solvers and graph operations. Our aim is to provide a fast SpMM implementation (RMerge2) for many-core architectures such as graphics processing units (GPUs) which operates with predictable and low memory overhead. RMerge2 breaks the sparse matrix-matrix product into many sparse vector-matrix products and computes these by parallel row merging. It merges up to 512 rows using warps of up to 32 threads, where each thread maintains states for multiple rows in registers. Larger numbers of rows (up to 32K, where K = 1024) are merged by a whole thread block using shared memory. The rows of the left-hand side are grouped into different cases based on their number of nonzeros and are processed with specific kernels implemented using C++ templates. This approach computes SpMM with memory overhead of only 5 bytes per row for left-hand sides with up to 32K nonzeros per row. Latencies of small but long kernels are hidden by concurrent kernel execution based on streams. Performance measurements show that merging more than 1 row per thread improves performance for homogeneous matrices by factors of up to 2.4. Case-based and concurrent kernels improve performance, particularly for heterogeneous matrices, by factors up to 1.6 and 3.7, respectively. Compared to a parallel CPU implementation, RMerge2 achieves a mean speedup of 4.5, and compared to three other GPU implementations, speedup factors of 11.3, 8.6, and 2.5 for matrix squaring and 7.4, 1.9, and 2.4 for Galerkin products are achieved. The Pascal GPU is faster than a Kepler GPU by an average factor of 3.6 for matrix squaring, i.e., higher than expected from the increase in nominal peak performance and memory bandwidth, suggesting that other improvements, including faster memory allocation, stream creation, and more warp shuffle operations, contribute to the overall performance increase.

机译：稀疏矩阵 - 矩阵乘法（SPMM）用于许多计算任务，包括代数MultiGrid求解器和图形操作。我们的目标是为许多核心架构提供快速SPMM实现（RMERGE2），例如具有可预测和低存储器开销的图形处理单元（GPU）。 RMERGE2将稀疏矩阵矩阵产品分解为许多稀疏向量 - 矩阵产品，并通过并行行合并计算这些。它可以使用多达32个线程的扭曲合并512行，其中每个线程在寄存器中为多行维护状态。使用共享存储器的整个线程块合并大量行（最多32K，其中k = 1024）由整个线程块合并。左侧的行基于它们的非系统数分组为不同的情况，并使用使用C ++模板实现的特定内核进行处理。此方法计算SPMM，左侧每行仅为5个字节的存储器开销，每行最多32K非安利斯。基于流的并发内核执行隐藏小但长内核的延迟。性能测量结果表明，每个线程合并超过1行，通过高达2.4的因素来提高均匀矩阵的性能。基于案例和并发核，提高了性能，特别是对于异构矩阵，分别为1.6和3.7的因素。与并行CPU实现相比，RMERGE2实现了4.5的平均加速，并与其他三种GPU实现相比，矩阵平方的11.3,8.6和2.5的加速因子和Galerkin产品的7.4,1.9和2.4。 Pascal GPU比矩阵平方的平均因子比BEPPLER GPU更快，即矩阵平方，即，从标称峰值性能和内存带宽的增加，高于预期，表明其他改进，包括更快的内存分配，流创建和更多经纱洗牌业务，有助于整体性能增加。

著录项

来源
《SIAM Journal on Scientific Computing》 |2018年第4期|共21页
作者
Gremse Felix; Kuepper Kerstin; Naumann Uwe;
展开▼
作者单位

Rhein Westfal TH Aachen Expt Mol Imaging D-52074 Aachen Germany;

Rhein Westfal TH Aachen MathCCES D-52062 Aachen Germany;

Rhein Westfal TH Aachen Software &

Tools Computat Engn D-52056 Aachen Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算数学;
关键词
sparse matrix-matrix multiplication; GPU-programming; Galerkin product; algebraic multigrid; fluorescence-mediated tomography;

机译：稀疏矩阵 - 矩阵乘法;GPU编程;Galerkin产品;代数Multigridrid;荧光介导的层析术;

相似文献

外文文献
中文文献
专利

1. MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES [J] . Gremse Felix, Kuepper Kerstin, Naumann Uwe SIAM Journal on Scientific Computing . 2018,第4期

机译：在许多核心架构上的行合并的内存高效的稀疏矩阵乘法
2. Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures [J] . Kadir Akbudak, Cevdet Aykanat IEEE Transactions on Parallel and Distributed Systems . 2017,第8期

机译：在多核架构上利用稀疏矩阵-矩阵乘法的局部性
3. GPU-ACCELERATED SPARSE MATRIX-MATRIX MULTIPLICATION BY ITERATIVE ROW MERGING [J] . Gremse Felix, Hoefter Andreas, Schwen Lars Ole, SIAM Journal on Scientific Computing . 2015,第1期

机译：通过迭代行合并实现GPU加速的稀疏矩阵-矩阵乘法
4. Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures [C] . Mehmet Deveci, Christian Trott, Sivasankaran Rajamanickam IEEE International Parallel and Distributed Processing Symposium Workshops . 2017

机译：用于多核架构的性能便携式稀疏矩阵矩阵乘法
5. Sparse Matrix Multiplication on a Many-core Platform [D] . Shi, Peiyao. 2018

机译：许多核心平台上的稀疏矩阵乘法
6. High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures [O] . Daehyun Kim, Joshua Trzasko, Mikhail Smelyanskiy, 2011

机译：使用多核架构的高性能3D压缩传感MRI重建
7. Multi-threaded Sparse Matrix-Matrix Multiplication for Many-Core and GPU Architectures [O] . Deveci, Mehmet, Trott, Christian, Rajamanickam, Sivasankaran 2018

机译：用于多核和GpU的多线程稀疏矩阵 - 矩阵乘法架构

MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES

摘要

著录项

相似文献

相关主题

期刊订阅