首页> 外文会议>Supercomputing frontiers >Optimization of Hierarchical Matrix Computation on GPU

【24h】

Optimization of Hierarchical Matrix Computation on GPU

机译：GPU上的层次矩阵计算优化

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The demand for dense matrix computation in large scale and complex simulations is increasing; however, the memory capacity of current computer system is insufficient for such simulations. Hierarchical matrix method (H-matrices) is attracting attention as a computational method that can reduce the memory requirements of dense matrix computations. However, the computation of H-matrices is more complex than that of dense and sparse matrices; thus, accelerating the H-matrices is required. We focus on H-matrix - vector multiplication (HMVM) on a single NVIDIA Tesla P100 GPU. We implement five GPU kernels and compare execution times among various processors (the Broadwell-EP, Skylake-SP, and Knights Landing) by OpenMP. The results show that, although an HMVM kernel can compute many small GEMV kernels, merging such kernels to a single GPU kernel was the most effective implementation. Moreover, the performance of BATCHED BLAS in the MAGMA library was comparable to that of the manually tuned GPU kernel.

机译：大规模和复杂模拟中对密集矩阵计算的需求不断增长；但是，当前计算机系统的存储容量不足以进行这种模拟。分层矩阵方法（H矩阵）作为一种可以减少密集矩阵计算的内存需求的计算方法引起了人们的注意。但是，H矩阵的计算要比密集和稀疏矩阵的计算复杂得多。因此，需要加速H矩阵。我们专注于单个NVIDIA Tesla P100 GPU上的H矩阵-向量乘法（HMVM）。我们实现了五个GPU内核，并比较了OpenMP在各种处理器（Broadwell-EP，Skylake-SP和Knights Landing）中的执行时间。结果表明，尽管HMVM内核可以计算许多小的GEMV内核，但将此类内核合并到单个GPU内核是最有效的实现。此外，MAGMA库中BATCHED BLAS的性能与手动调整的GPU内核的性能相当。

著录项

来源
《Supercomputing frontiers》|2018年|274-292|共19页
会议地点 Singapore(SG)
作者
Satoshi Ohshima; Ichitaro Yamazaki; Akihiro Ida; Rio Yokota;
展开▼
作者单位

Kyushu University, Fukuoka, Japan;

University of Tennessee, Knoxville, USA;

The University of Tokyo, Tokyo, Japan;

Tokyo Institute of Technology, Tokyo, Japan;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression [J] . Boukaram Wajih Halim, Turkiyyah George, Ltaief Hatem, Parallel Computing . 2018,第MAY期

机译：GPU上的批量QR和SVD算法及其在层次矩阵压缩中的应用
2. nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware [J] . Sven Koitka, Christoph M. Friedrich R News . 2016,第2期

机译：nmfgpu4R：使用支持CUDA的硬件对GPU进行非负矩阵分解（NMF）的加速计算
3. nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware [J] . Sven Koitka, Christoph M. Friedrich The R Journal . 2016,第2期

机译：nmfgpu4R：使用支持CUDA的硬件对GPU进行非负矩阵分解（NMF）的加速计算
4. Optimization of Hierarchical Matrix Computation on GPU [C] . Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Asian Supercomputing Conference . 2018

机译：GPU上分层矩阵计算的优化
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. Optimizing Data Intensive GPGPU Computations for DNA Sequence Alignment [O] . Cole Trapnell, Michael C. Schatz -1

机译：优化DNA序列对齐的数据密集型GPGPU计算
7. Optimization for performance and energy for batched matrix computations on GPUs [O] . Azzam Haidar, Tingxing Dong, Piotr Luszczek, 2015

机译：GPU上批量矩阵计算的性能和能量优化

Optimization of Hierarchical Matrix Computation on GPU

摘要

著录项

相似文献

相关主题

期刊订阅