首页> 外文会议>2012 19th International Conference on High Performance Computing >Sparse matrix-matrix multiplication on modern architectures
【24h】

Sparse matrix-matrix multiplication on modern architectures

机译:现代建筑上的稀疏矩阵矩阵乘法

获取原文
获取原文并翻译 | 示例

摘要

Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, respectively, among other applications find usage in various matrix formulations of graph problems. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting the output size. It is even more challenging when combined with the GPU architectural constraints such as memory accesses, limited shared memory, strict SIMD and thread execution. To address these challenges on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations targeted at sparse matrices. Our experiments indicate that the Row-Row formulation, which mostly outperforms the other formulations, is 3.5x faster on average compared to an optimized multi-core implementation in the Intel MKL library. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU also. In this direction, we present heuristics to find the right amount of work division between the CPU and the GPU. Our hybrid row-row formulation of the spgemm operation performs 5.5x faster on average when compared to the optimized multi-core implementation in the Intel MKL library. Our experience indicates that it is difficult to identify right amount of work division between the CPU and the GPU. We therefore investigate a subclass of sparse matrices, band matrices, and present an analytical method to identify a good work division when multiplying two band matrices. Our GPU csrmm operation performs 2.5x faster on average when compared to a corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.
机译:在其他应用中,稀疏矩阵稀疏/密集矩阵乘法spgemm和csrmm分别用于图问题的各种矩阵形式。考虑到执行图形问题的困难以及图形与矩阵之间的对偶性,诸如spgemm和csrmm之类的计算最近引起了HPC社区的关注。这些计算带来了挑战,例如负载平衡,计算的不规则性质以及难以预测输出大小。当与诸如内存访问,有限的共享内存,严格的SIMD和线程执行之类的GPU架构约束结合在一起时,挑战就更大了。为了解决GPU上的这些挑战,我们评估了矩阵乘法的三种可能的变化(行列,列行,行行),并针对稀疏矩阵执行了适当的优化。我们的实验表明,与英特尔MKL库中经过优化的多核实施相比,行-行格式在大多数情况下都优于其他格式,平均速度快3.5倍。我们将行行公式扩展到同时使用CPU的CPU + GPU混合算法。在这个方向上,我们提出启发式方法以找到CPU和GPU之间正确的工作量。与英特尔MKL库中的优化多核实施相比,我们的spgemm操作的混合行行公式平均执行速度快5.5倍。我们的经验表明,很难确定CPU和GPU之间正确的工作量。因此,我们研究了稀疏矩阵,带状矩阵的子类,并提出了一种分析方法,用于在将两个带状矩阵相乘时确定良好的工作分区。与cusparse库中的相应实现相比,我们的GPU csrmm操作平均快2.5倍,其性能优于Intel MKL库的实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号