首页> 外文期刊>International Journal of High Performance Computing Applications >Batched matrix computations on hardware accelerators based on GPUs
【24h】

Batched matrix computations on hardware accelerators based on GPUs

机译:基于GPU的硬件加速器上的批量矩阵计算

获取原文
获取原文并翻译 | 示例
           

摘要

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations, Cholesky, LU, and QR, for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient in our applications' context. We illustrate all of these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared with a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40GPU.
机译:科学应用需要求解器来解决彼此独立的许多小尺寸问题。同时,高端硬件发展迅速,并且变得越来越以吞吐量为导向,因此,越来越需要一种有效的方法来开发针对这些小矩阵问题的高能效,高性能代码,我们称之为批量分解。 。需要此功能的许多应用程序尤其可以受益于GPU的使用,目前,GPU在重要的科学工作负载上的能效比多核CPU高四到五倍。因此,本文描述了针对一组小型密集矩阵的最常见的单方面分解(Cholesky,LU和QR)的开发。我们提出的算法及其实现在设计上本质上是并行的。特别地,我们的方法基于将过程表示为完全在GPU上执行的一系列批处理BLAS例程。重要的是,这与LAPACK和混合MAGMA因式分解算法不同,后者在硬件设计和实现中涉及的各种计算内核的执行效率的假设完全不同的情况下工作。因此,对于应用程序用例感兴趣的问题大小,我们的方法比对多核CPU和GPU组合的方法更有效。在单个芯片(GPU或CPU)上一次分解单个问题的范例在我们的应用程序上下文中根本无效。我们将通过详细的性能分析来说明所有这些主张。与基于MKL库的高度优化的批处理CPU实现相比,借助概要分析和跟踪工具,我们指导了批处理分解的开发,以实现高达两倍的加速和三倍的能源效率。经过测试的系统具有两个Intel Sandy Bridge CPU插槽,并且与CUBLAS库的GPU批量批处理LU分解进行了比较,我们在NVIDIA K40GPU上实现了高达2.5倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号