首页> 外文期刊>Journal of Parallel and Distributed Computing >A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
【24h】

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

机译:异构系统上用于稀疏LU分解的避免通信的3D算法

获取原文
获取原文并翻译 | 示例
           

摘要

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of 0 (root log n) and latency by a factor of 0 (log n). For non planar cases, our algorithm can reduce the per-process communication volume by 3x and latency by o (n(1/3)) times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuPERLU_DIST. Our new 3D code achieves empirical speedups up to 27x for planar graphs and up to 3.3x for non-planar graphs over the baseline 2D SuPERLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (HALO) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24x for planar graphs and 3.5x for non-planar graphs over the baseline 2D SuPERLU_DIST with co-processor acceleration. (C) 2019 Elsevier Inc. All rights reserved.
机译:我们提出了一种新算法,以提高分布式存储系统上右稀疏LU分解的强大可伸缩性。我们用于稀疏LU的3D算法使用三维MPI过程网格,利用消除树并行性,并权衡增加的内存以减少每进程通信。我们还分析了平面图(例如,源自2D网格或网格离散化的那些)和某些非平面图(特别是3D网格和网格)的渐近改进。对于具有n个顶点的平面图,我们的算法将n的渐近通信量渐近地减少了0(根log n),将等待时间减少了0(log n)。对于非平面情况,我们的算法可以将每进程的通信量减少3倍,并将延迟减少o(n(1/3))倍。在所有情况下,实现这些增益所需的存储器都是一个恒定因素。我们通过扩展SuPERLU_DIST中使用的2D数据结构来实现算法。当在Cray XC30的24,000个内核上运行时,我们的新3D代码在基线2D SuPERLU_DIST上实现的平面图经验加速高达27倍,非平面图经验加速高达3.3倍。通过添加用于协处理器卸载的高度异步延迟卸载(HALO)算法,我们扩展了异构体系结构的3D算法[44]。在具有32768个CPU内核和4096个Nvidia K20x GPU的Cray XK7的4096个节点上,该3D算法在具有协处理器加速的基线2D SuPERLU_DIST上,对平面图形的经验性加速高达24倍,对于非平面图形的经验性加速高达3.5倍。 (C)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号