首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices
【24h】

A Communication-Avoiding 3D LU Factorization Algorithm for Sparse Matrices

机译:稀疏矩阵的避免通信的3D LU分解算法

获取原文

摘要

We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D sparse LU algorithm uses a three-dimensional MPI process grid, aggressively exploits elimination tree parallelism and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., from 2D grid or mesh domains) and certain non-planar graphs (specifically for 3D grids and meshes). For planar graphs with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O{log n} and latency by a factor of O{log n}. For non-planar cases, our algorithm can reduce the per-process communication volume by 3× and latency by O{n^1/3} times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in superLU. Our new 3D code achieves speedups up to 27× for planar graphs and up to 3.3× for non-planar graphs over the baseline 2D superLU when run on 24,000 cores of a Cray XC30.
机译:我们提出了一种新的算法,以提高分布式存储系统上右稀疏LU分解的强大可伸缩性。我们的3D稀疏LU算法使用三维MPI过程网格,积极利用消除树并行性并权衡增加的内存以减少每进程的通信。我们还分析了平面图(例如来自2D网格或网格域)和某些非平面图(特别是3D网格和网格)的渐近改进。对于具有n个顶点的平面图,我们的算法将n的通信量渐近地减少了O {log n},而等待时间则减少了O {log n}。对于非平面情况,我们的算法可以将每进程的通信量减少3倍,而延迟则减少O {n ^ 1/3}倍。在所有情况下,实现这些增益所需的存储器都是一个恒定因素。我们通过扩展superLU中使用的2D数据结构来实现我们的算法。当在Cray XC30的24,000个内核上运行时,我们的新3D代码在基准2D superLU上可将平面图的速度提高27倍,将非平面图的速度提高3.3倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号