【24h】

Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

机译:在GPU上优化数字内存绑定内核的系统方法

获取原文

摘要

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (<4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0.
机译:使用GPU在加速密集线性代数计算内核(DLA)方面非常有益。许多高性能数值库(例如CUBLAS,MAGMA和CULA)在GPU上提供BLAS和LAPACK实施,以及涉及CPU和GPU的混合计算。对于计算绑定操作,GPU通常会获得比CPU更好的性能,尤其是那些具有常规数据访问模式的GPU。本文重点介绍了一种系统方法,可通过利用底层设备的体系结构(例如高吞吐量)在GPU上有效地实现内存绑定的DLA内核。在最近的工作中,这种方法被证明优于以对称矩阵向量乘法(SYMV)为特征的不规则数据访问模式的现有最先进的GPU实现(Abdelfattah等人,VECPAR 2012)。我们建议将这种方法扩展到通用矩阵向量乘法(GEMV)内核。性能结果表明,对于较小到中等的矩阵大小,我们的GEMV实现实现了更好的性能,使其在计算Hessenberg和一般矩阵(雷达应用)的对角线折减方面非常有影响力,这是计算特征值和奇异值的第一步,分别。考虑到中小型矩阵(<4500),与现有的开源和商业软件解决方案相比,我们的GEMV内核在单精度(SP)方面平均提高了60%,在双精度(DP)方面平均提高了25%。这些结果改进了针对小型和大型矩阵的归约算法。与CUBLAS 5.0所提供的实施方案相比,GEMV性能的提高使Hessenberg的平均降低量分别达到30%(SP)和15%(DP),而对角线降低则高达25%(SP)和14%(DP)。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号