Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

机译：在GPU上优化数字内存绑定内核的系统方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (<4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0.

机译：使用GPU在加速密集线性代数计算内核（DLA）方面非常有益。许多高性能数值库（例如CUBLAS，MAGMA和CULA）在GPU上提供BLAS和LAPACK实施，以及涉及CPU和GPU的混合计算。对于计算绑定操作，GPU通常会获得比CPU更好的性能，尤其是那些具有常规数据访问模式的GPU。本文重点介绍了一种系统方法，可通过利用底层设备的体系结构（例如高吞吐量）在GPU上有效地实现内存绑定的DLA内核。在最近的工作中，这种方法被证明优于以对称矩阵向量乘法（SYMV）为特征的不规则数据访问模式的现有最先进的GPU实现（Abdelfattah等人，VECPAR 2012）。我们建议将这种方法扩展到通用矩阵向量乘法（GEMV）内核。性能结果表明，对于较小到中等的矩阵大小，我们的GEMV实现实现了更好的性能，使其在计算Hessenberg和一般矩阵（雷达应用）的对角线折减方面非常有影响力，这是计算特征值和奇异值的第一步，分别。考虑到中小型矩阵（<4500），与现有的开源和商业软件解决方案相比，我们的GEMV内核在单精度（SP）方面平均提高了60％，在双精度（DP）方面平均提高了25％。这些结果改进了针对小型和大型矩阵的归约算法。与CUBLAS 5.0所提供的实施方案相比，GEMV性能的提高使Hessenberg的平均降低量分别达到30％（SP）和15％（DP），而对角线降低则高达25％（SP）和14％（DP）。

著录项

来源
《Workshop on big data management in clouds;International conference on parallel processing;International workshop on algorithms, models and tools for parallel computing on heterogeneous platforms;CoreGRID/ERCIM workshop on grids, clouds, and P2P computing;Workshop on high-performance bioinformatics and biomedicine;Workshop on on-chip memory hierarchies and interconnects: organization management, and implementation;Paraphrase workshop;Workshop on productivity and performance;Workshop on resiliency in high-performance computing;Workshop on UnConventional high-performance computing;Workshop on virtualization in high-performance cloud computing》|2012年|207-216|共10页
会议地点 Rhodes Island(GR)
作者
Ahmad Abdelfattah; David Keyes; Hatem Ltaief;
展开▼
作者单位

Division of Mathematical and Computer Sciences and Engineering Thuwal Saudi Arabia;

Supercomputing Laboratory King Abdullah University of Science and Technology Thuwal Saudi Arabia;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Matrix-Vector Multiplication; GPU Optimizations; Memory-Bound Operations; Hessenberg Reduction; Bidiagonal Reduction;

机译：矩阵向量乘法GPU优化；内存绑定操作；黑森伯格减少；对角线缩小;

相似文献

外文文献
中文文献
专利

1. Optimal Kernel Design for Finite-Element Numerical Integration on GPUs [J] . Banas Krzysztof, Kruzel Filip, Bielanski Jan Computing in science & engineering . 2020,第6期

机译：GPU上有限元数值集成的最佳核心设计
2. Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria [J] . Hector Ortega-Arranz, Yuri Torres, Arturo Gonzalez-Escribano, Journal of supercomputing . 2014,第2期

机译：使用内核表征标准为NVIDIA GPU优化APSP实施
3. Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA [J] . J. Habich, T. Zeiser, G. Hager, Advances in Engineering Software . 2011,第5期

机译：使用CUDA在nVIDIA GPU上D3Q19晶格Boltzmann内核的性能分析和优化策略
4. Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU [C] . Ahmad Abdelfattah, David Keyes, Hatem Ltaief Workshop on big data management in clouds . 2013

机译：优化GPU数值内存内核的系统方法
5. On implementation and optimization of large-data scientific kernels on multicore processors and GPUs [D] . Hakeem, Mohammad Umar 2013

机译：在多核处理器和GPU上实现和优化大数据科学内核
6. Accuracy and Performance of Functional Parameter Estimation Using a Novel Numerical Optimization Approach for GPU-Based Kinetic Compartmental Modeling [O] . Igor Svistoun, Brandon Driscoll, Catherine Coolens 2019

机译：基于GPU的动力学隔室建模的新型数值优化方法估计功能参数的准确性和性能
7. Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators [O] . Ahmad Abdelfattah, Jack Dongarra, David Keyes, 2014

机译：在GPU硬件加速器上优化内存绑定SYMV内核
8. Time-Parallel Solutions to Ordinary Differential Equations on GPUs with a New Functional Optimization Approach Related to the Sobolev Gradient Method. [R] . Lederman, C., Cambier, J. 2012

机译：采用与sobolev梯度法相关的新功能优化方法对GpU上常微分方程的时间并行解。

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

摘要

著录项

相似文献

相关主题

期刊订阅