首页> 外文期刊>SIAM Journal on Scientific Computing >MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES
【24h】

MEMORY-EFFICIENT SPARSE MATRIX-MATRIX MULTIPLICATION BY ROW MERGING ON MANY-CORE ARCHITECTURES

机译:在许多核心架构上的行合并的内存高效的稀疏矩阵乘法

获取原文
获取原文并翻译 | 示例
           

摘要

Sparse matrix-matrix multiplication (SpMM) is used for many computational tasks including algebraic multigrid solvers and graph operations. Our aim is to provide a fast SpMM implementation (RMerge2) for many-core architectures such as graphics processing units (GPUs) which operates with predictable and low memory overhead. RMerge2 breaks the sparse matrix-matrix product into many sparse vector-matrix products and computes these by parallel row merging. It merges up to 512 rows using warps of up to 32 threads, where each thread maintains states for multiple rows in registers. Larger numbers of rows (up to 32K, where K = 1024) are merged by a whole thread block using shared memory. The rows of the left-hand side are grouped into different cases based on their number of nonzeros and are processed with specific kernels implemented using C++ templates. This approach computes SpMM with memory overhead of only 5 bytes per row for left-hand sides with up to 32K nonzeros per row. Latencies of small but long kernels are hidden by concurrent kernel execution based on streams. Performance measurements show that merging more than 1 row per thread improves performance for homogeneous matrices by factors of up to 2.4. Case-based and concurrent kernels improve performance, particularly for heterogeneous matrices, by factors up to 1.6 and 3.7, respectively. Compared to a parallel CPU implementation, RMerge2 achieves a mean speedup of 4.5, and compared to three other GPU implementations, speedup factors of 11.3, 8.6, and 2.5 for matrix squaring and 7.4, 1.9, and 2.4 for Galerkin products are achieved. The Pascal GPU is faster than a Kepler GPU by an average factor of 3.6 for matrix squaring, i.e., higher than expected from the increase in nominal peak performance and memory bandwidth, suggesting that other improvements, including faster memory allocation, stream creation, and more warp shuffle operations, contribute to the overall performance increase.
机译:稀疏矩阵 - 矩阵乘法(SPMM)用于许多计算任务,包括代数MultiGrid求解器和图形操作。我们的目标是为许多核心架构提供快速SPMM实现(RMERGE2),例如具有可预测和低存储器开销的图形处理单元(GPU)。 RMERGE2将稀疏矩阵矩阵产品分解为许多稀疏向量 - 矩阵产品,并通过并行行合并计算这些。它可以使用多达32个线程的扭曲合并512行,其中每个线程在寄存器中为多行维护状态。使用共享存储器的整个线程块合并大量行(最多32K,其中k = 1024)由整个线程块合并。左侧的行基于它们的非系统数分组为不同的情况,并使用使用C ++模板实现的特定内核进行处理。此方法计算SPMM,左侧每行仅为5个字节的存储器开销,每行最多32K非安利斯。基于流的并发内核执行隐藏小但长内核的延迟。性能测量结果表明,每个线程合并超过1行,通过高达2.4的因素来提高均匀矩阵的性能。基于案例和并发核,提高了性能,特别是对于异构矩阵,分别为1.6和3.7的因素。与并行CPU实现相比,RMERGE2实现了4.5的平均加速,并与其他三种GPU实现相比,矩阵平方的11.3,8.6和2.5的加速因子和Galerkin产品的7.4,1.9和2.4。 Pascal GPU比矩阵平方的平均因子比BEPPLER GPU更快,即矩阵平方,即,从标称峰值性能和内存带宽的增加,高于预期,表明其他改进,包括更快的内存分配,流创建和更多经纱洗牌业务,有助于整体性能增加。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号