A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Sao Piyush; Li Xiaoye S.; Vuduc Richard

首页> 外文期刊>Journal of Parallel and Distributed Computing >A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

【24h】

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

机译：异构系统上用于稀疏LU分解的避免通信的3D算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of 0 (root log n) and latency by a factor of 0 (log n). For non planar cases, our algorithm can reduce the per-process communication volume by 3x and latency by o (n(1/3)) times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuPERLU_DIST. Our new 3D code achieves empirical speedups up to 27x for planar graphs and up to 3.3x for non-planar graphs over the baseline 2D SuPERLU_DIST when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload (HALO) algorithm for co-processor offload [44]. On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24x for planar graphs and 3.5x for non-planar graphs over the baseline 2D SuPERLU_DIST with co-processor acceleration. (C) 2019 Elsevier Inc. All rights reserved.

机译：我们提出了一种新算法，以提高分布式存储系统上右稀疏LU分解的强大可伸缩性。我们用于稀疏LU的3D算法使用三维MPI过程网格，利用消除树并行性，并权衡增加的内存以减少每进程通信。我们还分析了平面图（例如，源自2D网格或网格离散化的那些）和某些非平面图（特别是3D网格和网格）的渐近改进。对于具有n个顶点的平面图，我们的算法将n的渐近通信量渐近地减少了0（根log n），将等待时间减少了0（log n）。对于非平面情况，我们的算法可以将每进程的通信量减少3倍，并将延迟减少o（n（1/3））倍。在所有情况下，实现这些增益所需的存储器都是一个恒定因素。我们通过扩展SuPERLU_DIST中使用的2D数据结构来实现算法。当在Cray XC30的24,000个内核上运行时，我们的新3D代码在基线2D SuPERLU_DIST上实现的平面图经验加速高达27倍，非平面图经验加速高达3.3倍。通过添加用于协处理器卸载的高度异步延迟卸载（HALO）算法，我们扩展了异构体系结构的3D算法[44]。在具有32768个CPU内核和4096个Nvidia K20x GPU的Cray XK7的4096个节点上，该3D算法在具有协处理器加速的基线2D SuPERLU_DIST上，对平面图形的经验性加速高达24倍，对于非平面图形的经验性加速高达3.5倍。（C）2019 Elsevier Inc.保留所有权利。

著录项

来源
《Journal of Parallel and Distributed Computing》 |2019年第9期|218-234|共17页
作者
Sao Piyush; Li Xiaoye S.; Vuduc Richard;
展开▼
作者单位

Oak Ridge Natl Lab, Oak Ridge, TN 37830 USA|Georgia Inst Technol, Atlanta, GA 30332 USA;

Lawrence Berkeley Natl Lab, Berkeley, CA USA;

Georgia Inst Technol, Atlanta, GA 30332 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems [J] . Sao Piyush, Li Xiaoye S., Vuduc Richard Journal of Parallel and Distributed Computing . 2019,第Sepa期

机译：异构系统稀疏LU分解的通信3D算法
2. EXACT SOLUTION OF SPARSE LINEAR SYSTEMS VIA LEFT-LOOKING ROUNDOFF-ERROR-FREE LU FACTORIZATION IN TIME PROPORTIONAL TO ARITHMETIC WORK [J] . Lourenco Christopher, Escobedo Adolfo R., Moreno-Centeno Erick, SIAM Journal on Matrix Analysis and Applications . 2019,第2期

机译：通过左眼循环无差错的LU分解的精确解决稀疏线性系统与算术工作成比例
3. IMF: An incomplete multifrontal LU-factorization for element-structured sparse linear systems [J] . Vannieuwenhoven N., Meerbergen K. SIAM Journal on Scientific Computing . 2013,第1期

机译：IMF：元素结构的稀疏线性系统的不完全多面LU分解
4. A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices [C] . Piyush Sao, Xiaoye Sherry Li, Richard Vuduc IEEE International Parallel and Distributed Processing Symposium . 2018

机译：稀疏矩阵的避免通信的3D LU分解算法
5. Performance Optimization for Sparse Matrix Factorization Algorithms on Hybrid Multicore Architectures [D] . Tang, Meng. 2020

机译：混合多核架构上稀疏矩阵分解算法的性能优化
6. Fast and efficient fully 3D PET image reconstruction using sparse system matrix factorization with GPU acceleration [O] . Jian Zhou, Jinyi Qi -1

机译：使用具有GpU加速稀疏系统矩阵分解快速高效的全3D pET图像重建
7. New Scheduling Strategies and Hybrid Programming for a Parallel Right-looking Sparse LU Factorization Algorithm on Multicore Cluster Systems [O] . Ichitaro Yamazaki, Xiaoye S. Li 2014

机译：多核集群系统并行右视稀疏LU分解算法的新调度策略与混合编程

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅