Sparse LU factorization with partial pivoting is important to many scientific applications, but the effective parallelization of this algorithm is still an open problem. The main difficulty is that partial pivoting operations make structures of L and U factors unpredictable beforehand. This paper presents a novel approach called S* for parallelizing this problem on distributed memory machines. S* incorporates static symbolic factorization to avoid run-time control overhead and uses nonsymmetric L/U supernode partitioning and amalgamation strategies to maximize the use of BLAS-3 routines. The irregular task parallelism embedded in sparse LU is exploited using graph scheduling and efficient run-time support techniques which optimize communication, overlap computation with communication and balance processor loads. The experimental results on the Cray-T3D with a set of Harwell-Boeing nonsymmetric matrices are very encouraging and good scalability has been achieved. Even compared to a highly optimized sequential code, the parallel speedups are still impressive considering the current status of sparse LU research.
具有部分枢轴的稀疏LU分解对于许多科学应用很重要,但是该算法的有效并行化仍然是一个未解决的问题。主要困难在于,部分枢转操作会使L和U因子的结构事先无法预测。本文提出了一种称为S *的新颖方法,用于在分布式存储机器上并行化此问题。 S *包含静态符号分解,以避免运行时控制开销,并使用非对称L / U超节点分区和合并策略来最大程度地利用BLAS-3例程。使用图调度和有效的运行时支持技术来开发嵌入在稀疏LU中的不规则任务并行性,该技术可优化通信,与通信重叠的计算并平衡处理器负载。在带有一组Harwell-Boeing非对称矩阵的Cray-T3D上的实验结果令人鼓舞,并且实现了良好的可伸缩性。即使与高度优化的顺序代码相比,考虑到当前稀疏LU研究的现状,并行加速仍然令人印象深刻。 P>
机译:高效的稀疏LU因式分解,可部分分布式数据存储架构
机译:行的稀疏LU分解的行合并树
机译:基于LU分解的漏矩阵与列和行锦标赛枢转的低等级近似
机译:分布式存储机器上具有部分透视图的稀疏LU分解
机译:在分布式存储系统上部分旋转的高斯消除
机译:稀疏的分布式内存:了解专家内存的速度和健壮性
机译:分布式存储机器上具有部分透视图的稀疏LU分解