首页> 外文会议>2011 25th IEEE International Parallel Distributed Processing Symposium >QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
【24h】

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

机译:通过多个GPU加速器增强的多核节点上的QR因式分解

获取原文

摘要

One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bo--unds that we obtained using Linear Programming.
机译:Exascale架构设计的主要趋势之一是使用GPU加速器增强的多核节点。因此,充分利用基于混合加速器的节点的所有资源,是迈向百亿亿次计算的根本步骤。在本文中,我们提出了针对此类节点的高效QR因式分解的设计。我们的方法分三个步骤。第一步包括将QR分解表示为一系列精心选择的粒度任务,旨在在CPU内核或GPU上执行。我们表明,我们可以从最初为同类多核体系结构设计的文献中有效地采用高级算法。第二步包括设计实现每个单独任务的内核。我们使用以前工作中的CPU内核,并提出了适用于GPU的新内核,以补充MAGMA库中已有的内核。我们展示了这些GPU内核对性能的影响。特别是,我们展示了新的混合CPU / GPU内核的优势。最后一步包括在计算单元上安排这些任务。我们提出两种替代方法,分别基于静态和动态调度。在静态调度的情况下,我们利用调度的先验知识来执行连续优化,从而获得非常高的性能。但是,我们强调了这种方法的可移植性的不足及其在相对同质节点上相对简单算法的局限性。另外,通过依靠高效的运行时系统Star PU(负责确保数据可用性和一致性),我们可以在复杂的异构节点上调度更复杂的算法,从而提高生产率。在后一种情况下,我们证明了由于应用程序与运行时系统之间的良好交互,我们可以通过便携式方式实现高性能。我们证明,所获得的性能非常接近理论上极限 -- 我们使用线性规划获得的unds。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号