首页> 外文会议>International workshop on accelerator programming using directives >Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
【24h】

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

机译:考虑大稀疏矩阵的基于块特征求解器的基于指令的GPU编程模型评估

获取原文

摘要

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8x-4.3x speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9×and 48.2× speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.
机译:对于诸如多核CPU和GPU之类的加速器之类的异构计算系统,实现大规模科学应用的高性能和高性能可移植性是一项重大挑战。在这项工作中,我们使用两个流行的基于指令的编程模型(OpenMP和OpenACC)为GPU加速系统实现了广泛使用的块特征求解器,即局部最优块预处理共轭梯度(LOBPCG)。我们的工作与现有工作的不同之处在于,它采用了一种整体方法来优化整个求解器的性能,而不是将问题缩小到较小的内核(例如SpMM,SpMV)中。当使用四个不同的输入矩阵进行测试时,我们的LOPBCG GPU实现比优化的CPU实现实现了2.8x-4.3x的加速。经过评估的配置将一个Skylake CPU与一个Skylake CPU和一个NVIDIA V100 GPU进行了比较。我们的OpenMP和OpenACC LOBPCG GPU实现提供了几乎相同的性能。我们还考虑了如何创建一个有效的LOBPCG求解器,该求解器可以解决大于GPU内存容量的问题。为此,我们在LOBPCG中创建表示两个主要内核(内部产品和SpMM内核)的微基准,然后使用两种不同的编程方法评估性能:平铺内核,以及将统一内存与原始内核一起使用。与分别在具有PCIe Gen3和NVLink 2.0 CPU到GPU互连的超级计算机上的统一内存实现相比,我们的平铺SpMM实现实现了2.9倍和48.2倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号