QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

机译：通过多个GPU加速器增强的多核节点上的QR因式分解

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bo--unds that we obtained using Linear Programming.

机译：Exascale架构设计的主要趋势之一是使用GPU加速器增强的多核节点。因此，充分利用基于混合加速器的节点的所有资源，是迈向百亿亿次计算的根本步骤。在本文中，我们提出了针对此类节点的高效QR因式分解的设计。我们的方法分三个步骤。第一步包括将QR分解表示为一系列精心选择的粒度任务，旨在在CPU内核或GPU上执行。我们表明，我们可以从最初为同类多核体系结构设计的文献中有效地采用高级算法。第二步包括设计实现每个单独任务的内核。我们使用以前工作中的CPU内核，并提出了适用于GPU的新内核，以补充MAGMA库中已有的内核。我们展示了这些GPU内核对性能的影响。特别是，我们展示了新的混合CPU / GPU内核的优势。最后一步包括在计算单元上安排这些任务。我们提出两种替代方法，分别基于静态和动态调度。在静态调度的情况下，我们利用调度的先验知识来执行连续优化，从而获得非常高的性能。但是，我们强调了这种方法的可移植性的不足及其在相对同质节点上相对简单算法的局限性。另外，通过依靠高效的运行时系统Star PU（负责确保数据可用性和一致性），我们可以在复杂的异构节点上调度更复杂的算法，从而提高生产率。在后一种情况下，我们证明了由于应用程序与运行时系统之间的良好交互，我们可以通过便携式方式实现高性能。我们证明，所获得的性能非常接近理论上极限 -- 我们使用线性规划获得的unds。

著录项

来源
《2011 25th IEEE International Parallel Distributed Processing Symposium》|2011年|p.932-943|共12页
会议地点
作者
Agullo Emmanuel; Augonnet Cedric; Dongarra Jack; Faverge Mathieu; Ltaief Hatem; Thibault Samuel; Tomov Stanimire;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.133;
关键词

相似文献

外文文献
中文文献
专利

1. MIXED-PRECISION CHOLESKY QR FACTORIZATION AND ITS CASE STUDIES ON MULTICORE CPU WITH MULTIPLE GPUs [J] . Yamazaki Ichitaro, Tomov Stanimire, Dongarra Jack SIAM Journal on Scientific Computing . 2015,第3期

机译：带多个GPU的多核CPU混合精度胆小QR分解及其案例研究
2. One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators* [J] . Ichitaro Yamazaki, Stanimire Tomov, Jack Dongarra Procedia Computer Science . 2012,第1期

机译：具有多个GPU加速器的多核上的单侧密集矩阵分解[ce：sup loc =“ post”> *
3. New Algorithm for Tensor Contractions on Multi-Core CPUs, GPUs, and Accelerators Enables CCSD and EOM-CCSD Calculations with over 1000 Basis Functions on a Single Compute Node [J] . Kaliman Ilya A., Krylov Anna I. Journal of Computational Chemistry: Organic, Inorganic, Physical, Biological . 2017,第11a12期

机译：多核CPU，GPU和加速器上的张量凹陷的新算法使CCSD和EOM-CCSD计算能够在单个计算节点上具有超过1000个基础函数的计算
4. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators [C] . Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, IEEE International Parallel and Distributed Processing Symposium . 2011

机译：使用多个GPU加速器增强多核节点上的QR分解
5. On implementation and optimization of large-data scientific kernels on multicore processors and GPUs [D] . Hakeem, Mohammad Umar 2013

机译：在多核处理器和GPU上实现和优化大数据科学内核
6. Evolutionary profiles from the QR factorization of multiple sequence alignments [O] . Anurag Sethi, Patrick ODonoghue, Zaida Luthey-Schulten 2005

机译：来自多个序列比对的QR分解的进化谱
7. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators [O] . Agullo, Emmanuel, Augonnet, Cédric, Dongarra, Jack, 2011

机译：通过多个GPU加速器增强的多核节点上的QR因式分解

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

摘要

著录项

相似文献

相关主题

期刊订阅