首页> 外文期刊>Parallel Computing >Hierarchical QR factorization algorithms for multi-core clusters
【24h】

Hierarchical QR factorization algorithms for multi-core clusters

机译:多核集群的分层QR因式分解算法

获取原文
获取原文并翻译 | 示例

摘要

This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed nodes, where a node is a multi-core processor. These platforms represent the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism). Each tile algorithm is uniquely characterized by its sequence of reduction trees. In the context of a cluster of nodes, in order to minimize the number of inter-processor communications (aka, "communication-avoiding"), it is natural to consider hierarchical trees composed of an "inter-node" tree which acts on top of "intra-node" trees. At the intra-node level, we propose a hierarchical tree made of three levels: (0) "TS level" for cache-friendliness, (1) "low-level" for decoupled highly parallel inter-node reductions, (2) "domino level" to efficiently resolve interactions between local reductions and global reductions. Our hierarchical algorithm and its implementation are flexible and modular, and can accommodate several kernel types, different distribution layouts, and a variety of reduction trees at all levels, both inter-node and intra-node. Numerical experiments on a cluster of multi-core nodes (ⅰ) confirm that each of the four levels of our hierarchical tree contributes to build up performance and (ⅱ) build insights on how these levels influence performance and interact within each other. Our implementation of the new algorithm with the DAGuE scheduling tool significantly outperforms currently available QR factorization software for all matrix shapes, thereby bringing a new advance in numerical linear algebra for pet-ascale and exascale platforms.
机译:本文介绍了一种新的QR因式分解算法,该算法专为结合了并行分布式节点的大规模并行平台而设计,其中节点是多核处理器。这些平台代表了高性能计算的现在和可预见的未来。我们的新QR分解算法属于tile算法类别,该算法自然为内核执行的顺序内核提供了良好的数据局部性(较高的顺序性能),并行分布式设置中的消息数量较少(等待时间短),并且很好粒度(高并行度)。每个平铺算法的独特之处在于其约简树的顺序。在节点集群的上下文中,为了最大程度地减少处理器间通信(也称为“避免通信”)的数量,自然要考虑由作用在顶部的“节点间”树组成的分层树节点内树的数量。在节点内级别,我们提出了一个由三个级别组成的层次树:(0)“ TS级别”用于缓存友好性,(1)“ low-level”用于解耦的高度并行的节点间约简,(2)“多米诺骨牌级”以有效解决局部减少量和总体减少量之间的相互作用。我们的分层算法及其实现是灵活和模块化的,并且可以在节点间和节点内的所有级别上容纳几种内核类型,不同的分布布局以及各种还原树。在多核节点群集上进行的数值实验(ⅰ)确认我们的分层树的四个级别中的每个级别都有助于提高性能,并且(ⅱ)深入了解这些级别如何影响性能以及彼此之间如何相互作用。我们使用DAGuE调度工具对新算法的实现大大优于当前可用于所有矩阵形状的QR分解软件,从而为Petascale和Exascale平台带来了数值线性代数方面的新进展。

著录项

  • 来源
    《Parallel Computing》 |2013年第5期|212-232|共21页
  • 作者单位

    University of Tennessee Knoxville 1122 Volunteer Blvd. Knoxville, TN 37996, USA,Oak Ridge National Laboratory 1 Bethel Valley Rd. Oak Ridge, TN 37831. USA,Manchester University, UK School of Computer Science, Manchester, M13 9PL, United Kingdom;

    University of Tennessee Knoxville 1122 Volunteer Blvd. Knoxville, TN 37996, USA;

    University of Tennessee Knoxville 1122 Volunteer Blvd. Knoxville, TN 37996, USA;

    INRIA Saclay Campus de I'Ecole Polytechnique, 91120 Palaiseau, France;

    University of Colorado Denver PO Box 173364. Denver, CO 80217-3364. USA;

    University of Tennessee Knoxville 1122 Volunteer Blvd. Knoxville, TN 37996, USA,Ecole Normale Superieure de Lyon, 69364 Lyon Cedex 07, France;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    QR factorization; numerical linear algebra; hierarchical architecture; distributed memory; cluster; multi-core;

    机译:QR分解数值线性代数层次结构;分布式内存;簇;多核;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号