首页> 外文期刊>Journal of Parallel and Distributed Computing >A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
【24h】

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

机译:异构系统稀疏LU分解的通信3D算法

获取原文
获取原文并翻译 | 示例

摘要

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of 0 (root log n) and latency by a factor of 0 (log n). For non planar cases, our algorithm can reduce the per-process communication volume by 3x and latency by o (n(1/3)) times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuPERLU_DIST. Our new 3D code achieves empirical speedups up to 27x for planar graphs and up to 3.3x for non-planar graphs over the baseline 2D SuPERLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (HALO) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24x for planar graphs and 3.5x for non-planar graphs over the baseline 2D SuPERLU_DIST with co-processor acceleration. (C) 2019 Elsevier Inc. All rights reserved.
机译:我们提出了一种新的算法来提高分布式存储系统右侧稀疏LU分解的强大可扩展性。我们的3D稀疏LU算法使用三维MPI进程网格,利用消除树并行性,并为每次流程通信减少的内存进行交易。我们还分析了平面图的渐近改善(例如,由2D电网或网格离散化引起的那些)和某些非平面图(专门用于3D网格和网格)。对于具有N个顶点的平面图,我们的算法将N值为0(根日志n)和延迟的倍数为0(log n)。对于非平面案例,我们的算法可以通过o(n(1/3))次数通过3倍和延迟来减少每次过程通信量。在所有情况下,实现这些收益所需的记忆是一个恒定因素。我们通过扩展superlu_dist中使用的2d数据结构来实现算法。我们的新3D代码在基线2D Superlu_Dist上实现了高达27倍的实证加速,最高可达27倍,对于基线2D Superlu_Dist,在CRAY XC30的24,000核上运行时,对于基线2D Superlu_Dist的非平面图。通过添加用于协处理器卸载的高异步懒惰卸载(HALO)算法来扩展异构架构的3D算法[44]。在CRAY XK7的4096个节点上,具有32,768个CPU核心和4096个NVIDIA K20x GPU,3D算法在基线2D SuperLu_Dist上实现了高达24倍的实证加速,最高可达24倍,用于基线2D Superlu_Dist的非平面图,具有共处理器加速度。 (c)2019 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号