首页> 外文会议>Supercomputing frontiers >Optimization of Hierarchical Matrix Computation on GPU
【24h】

Optimization of Hierarchical Matrix Computation on GPU

机译:GPU上的层次矩阵计算优化

获取原文
获取原文并翻译 | 示例

摘要

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H-matrices is more complex than that of dense and sparse matrices; thus, accelerating the H-matrices is required. We focus on H-matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.
机译:大规模和复杂模拟中对密集矩阵计算的需求不断增长;但是,当前计算机系统的存储容量不足以进行这种模拟。分层矩阵方法(H矩阵)作为一种可以减少密集矩阵计算的内存需求的计算方法引起了人们的注意。但是,H矩阵的计算要比密集和稀疏矩阵的计算复杂得多。因此,需要加速H矩阵。我们专注于单个NVIDIA Tesla P100 GPU上的H矩阵-向量乘法(HMVM)。我们实现了五个GPU内核,并比较了OpenMP在各种处理器(Broadwell-EP,Skylake-SP和Knights Landing)中的执行时间。结果表明,尽管HMVM内核可以计算许多小的GEMV内核,但将此类内核合并到单个GPU内核是最有效的实现。此外,MAGMA库中BATCHED BLAS的性能与手动调整的GPU内核的性能相当。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号