Batched matrix computations on hardware accelerators based on GPUs

Haidar Azzam; Dong Tingxing; Luszczek Piotr; Tomov Stanimire; Dongarra Jack

首页> 外文期刊>International Journal of High Performance Computing Applications >Batched matrix computations on hardware accelerators based on GPUs

【24h】

Batched matrix computations on hardware accelerators based on GPUs

机译：基于GPU的硬件加速器上的批量矩阵计算

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications' context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40GPU.

机译：科学应用需要求解器来解决彼此独立的许多小尺寸问题。同时，高端硬件发展迅速，并且变得越来越以吞吐量为导向，因此，越来越需要一种有效的方法来开发针对这些小矩阵问题的高能效，高性能代码，我们称之为批量分解。。需要此功能的许多应用程序尤其可以受益于GPU的使用，目前，GPU在重要的科学工作负载上的能效比多核CPU高四到五倍。因此，本文描述了针对一组小型密集矩阵的最常见的单方面分解（Cholesky，LU和QR）的开发。我们提出的算法及其实现在设计上本质上是并行的。特别地，我们的方法基于将过程表示为完全在GPU上执行的一系列批处理BLAS例程。重要的是，这与LAPACK和混合MAGMA因式分解算法不同，后者在硬件设计和实现中涉及的各种计算内核的执行效率的假设完全不同的情况下工作。因此，对于应用程序用例感兴趣的问题大小，我们的方法比对多核CPU和GPU组合的方法更有效。在单个芯片（GPU或CPU）上一次分解单个问题的范例在我们的应用程序上下文中根本无效。我们将通过详细的性能分析来说明所有这些主张。与基于MKL库的高度优化的批处理CPU实现相比，借助概要分析和跟踪工具，我们指导了批处理分解的开发，以实现高达两倍的加速和三倍的能源效率。经过测试的系统具有两个Intel Sandy Bridge CPU插槽，并且与CUBLAS库的GPU批量批处理LU分解进行了比较，我们在NVIDIA K40GPU上实现了高达2.5倍的加速。

著录项

来源
《International Journal of High Performance Computing Applications》 |2015年第2期|193-208|共16页
作者
Haidar Azzam; Dong Tingxing; Luszczek Piotr; Tomov Stanimire; Dongarra Jack;
展开▼
作者单位

Univ Tennessee, ICL, Knoxville, TN 37996 USA|Univ Tennessee, Knoxville, TN 37996 USA;

Univ Tennessee, ICL, Knoxville, TN 37996 USA;

Univ Tennessee, Knoxville, TN 37996 USA;

Univ Tennessee, ICL, Knoxville, TN 37996 USA|Univ Tennessee, EECS, Knoxville, TN 37996 USA;

Univ Tennessee, Knoxville, TN 37996 USA|Oak Ridge Natl Lab, Oak Ridge, TN USA|Univ Manchester, Manchester, Lancs, England;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Batched factorization; numerical linear algebra; hardware accelerators; numerical software libraries; one-sided factorization algorithms;

机译：批量分解;数值线性代数;硬件加速器;数值软件库;单边分解算法;

相似文献

外文文献
中文文献
专利

1. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators [J] . Jack Dongarra, Mark Gates, Jakub Kurzak, Proceedings of the IEEE . 2018,第11期

机译：使用GPU硬件加速器自动调谐数值密集线性代数以进行批处理计算
2. nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware [J] . Sven Koitka, Christoph M. Friedrich R News . 2016,第2期

机译：nmfgpu4R：使用支持CUDA的硬件对GPU进行非负矩阵分解（NMF）的加速计算
3. nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware [J] . Sven Koitka, Christoph M. Friedrich The R Journal . 2016,第2期

机译：nmfgpu4R：使用支持CUDA的硬件对GPU进行非负矩阵分解（NMF）的加速计算
4. Theoretical Model of Computation and Algorithms for FPGA-Based Hardware Accelerators [C] . Martin Hora, Vaclav Koncicky, Jakub Tetek Annual conference on theory and applications of models of computation . 2019

机译：基于FPGA的硬件加速器的计算和算法理论模型
5. Network on chip based hardware accelerators for computational biology. [D] . Sarkar, Souradip. 2010

机译：用于计算生物学的基于芯片上网络的硬件加速器。
6. Protein-protein docking on hardware accelerators: comparison of GPU and MIC architectures [O] . Takehiro Shimoda, Shuji Suzuki, Masahito Ohue, 2015

机译：蛋白质对接在硬件加速器上：GPU和MIC架构的比较
7. Optimization for performance and energy for batched matrix computations on GPUs [O] . Azzam Haidar, Tingxing Dong, Piotr Luszczek, 2015

机译：GPU上批量矩阵计算的性能和能量优化

Batched matrix computations on hardware accelerators based on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅