Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

机译：考虑大稀疏矩阵的基于块特征求解器的基于指令的GPU编程模型评估

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8x-4.3x speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9×and 48.2× speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.

机译：对于诸如多核CPU和GPU之类的加速器之类的异构计算系统，实现大规模科学应用的高性能和高性能可移植性是一项重大挑战。在这项工作中，我们使用两个流行的基于指令的编程模型（OpenMP和OpenACC）为GPU加速系统实现了广泛使用的块特征求解器，即局部最优块预处理共轭梯度（LOBPCG）。我们的工作与现有工作的不同之处在于，它采用了一种整体方法来优化整个求解器的性能，而不是将问题缩小到较小的内核（例如SpMM，SpMV）中。当使用四个不同的输入矩阵进行测试时，我们的LOPBCG GPU实现比优化的CPU实现实现了2.8x-4.3x的加速。经过评估的配置将一个Skylake CPU与一个Skylake CPU和一个NVIDIA V100 GPU进行了比较。我们的OpenMP和OpenACC LOBPCG GPU实现提供了几乎相同的性能。我们还考虑了如何创建一个有效的LOBPCG求解器，该求解器可以解决大于GPU内存容量的问题。为此，我们在LOBPCG中创建表示两个主要内核（内部产品和SpMM内核）的微基准，然后使用两种不同的编程方法评估性能：平铺内核，以及将统一内存与原始内核一起使用。与分别在具有PCIe Gen3和NVLink 2.0 CPU到GPU互连的超级计算机上的统一内存实现相比，我们的平铺SpMM实现实现了2.9倍和48.2倍的加速。

著录项

来源
《International workshop on accelerator programming using directives》|2019年|66-88|共23页
会议地点
作者
Fazlay Rabbi; Christopher S. Daley; Hasan Metin Aktulga; Nicholas J. Wright;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sparse solvers; Performance optimization; Performance portability; Directive based programming models; OpenMP 4.5; OpenACC;

机译：稀疏求解器;性能优化;性能可移植性;基于指令的编程模型; OpenMP 4.5; OpenACC;

相似文献

外文文献
中文文献
专利

1. Exploring the interoperability of remote GPGPU virtualization using rCUDA and directive-based programming models [J] . Castello Adrian, Pena Antonio J., Mayo Rafael, Journal of supercomputing . 2018,第11期

机译：使用rCUDA和基于指令的编程模型探索远程GPGPU虚拟化的互操作性
2. Multi-GPU Support on Single Node Using Directive-Based Programming Model [J] . RenganXu, XiaonanTian, SunitaChandrasekaran, Scientific programming . 2015,第4期

机译：使用基于指令的编程模型在单节点上支持多GPU
3. Multi-GPU Support on Single Node Using Directive-Based Programming Model [J] . Xu Rengan, Tian Xiaonan, Chandrasekaran Sunita, Scientific programming . 2015,第期

机译：使用基于指令的编程模型在单节点上支持多GPU
4. Early evaluation of directive-based GPU programming models for productive exascale computing [C] . Lee Seyong, Vetter Jeffrey S. 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. . 2012

机译：对基于指令的GPU编程模型进行早期评估以进行生产万亿级计算
5. Directive-based general-purpose GPU programming [D] . Han, Tian Yi David. 2009

机译：基于指令的通用GPU编程
6. Integer Programs for One- and Two-Mode Blockmodeling Based on Prespecified Image Matrices for Structural and Regular Equivalence [O] . Michael J. Brusco, Douglas Steinley -1

机译：基于用于结构和常规等价的预先定义的图像矩阵的一个和双模式块的整数程序
7. Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices [O] . Fazlay Rabbi, Christopher S. Daley, Hasan Metin Aktulga, 2020

机译：考虑大稀疏矩阵的块截头子手术指令基于指令的GPU编程模型的评估

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

摘要

著录项

相似文献

相关主题

期刊订阅