Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

Dumas Jean-Guillaume; Gautier Thierry; Pernet Clement; Roch Jean-Louis; Sultan Ziad

首页> 外文期刊>Parallel Computing >Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

【24h】

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

机译：基于递归的精确密集线性代数例程的高斯消除并行化

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies. (C) 2015 Elsevier B.V. All rights reserved.

机译：我们提出了块算法及其在共享存储架构上实现亚三次高斯消除并行化的实现。与并行数值线性代数中的经典三次算法相反，我们在这里集中讨论递归算法和粗粒度并行化。确实，亚三次矩阵算术只能通过递归算法来实现，这使得粗粒块算法比细粒算法更有效。这项工作是由有限域上的密集线性代数的设计和实现所推动的，在该领域中广泛使用快速矩阵乘法，而昂贵的模数归约法也倡导粗粒分解。我们逐步构建高效的内核，首先进行矩阵乘法，然后进行三角系统求解，然后在其上构建递归PLUQ分解算法。我们使用几种算法变体研究这些内核的并行化：迭代或递归，并使用不同的拆分策略。实验表明，矩阵乘法的递归自适应方法，三角系统求解的混合递归迭代方法以及PLUQ分解的平铺递归版本以及各种数据映射策略，在32核NUMA架构上提供了最佳性能。总的来说，我们证明了模块化约简的开销远远超过了快速线性代数算法所能弥补的，并且即使在存在秩不足的情况下，精确的稠密线性代数也能与全秩参考数值软件的性能相匹配。（C）2015 Elsevier B.V.保留所有权利。

著录项

来源
《Parallel Computing》 |2016年第9期|235-249|共15页
作者
Dumas Jean-Guillaume; Gautier Thierry; Pernet Clement; Roch Jean-Louis; Sultan Ziad;
展开▼
作者单位

Univ Grenoble Alpes, Lab Jean Kuntzmann, CNRS, Inria, Grenoble, France;

Univ Grenoble Alpes, Lab Informat Grenoble, CNRS, Inria, Grenoble, France;

Univ Grenoble Alpes, Lab Informat & Paralllisme, Univ Lyon, Inria, Grenoble, France;

Univ Grenoble Alpes, Lab Informat Grenoble, CNRS, Inria, Grenoble, France;

Univ Grenoble Alpes, Lab Jean Kuntzmann, CNRS, Inria, Grenoble, France|Univ Grenoble Alpes, Lab Informat Grenoble, CNRS, Inria, Grenoble, France;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
PLUQ decomposition; Parallel shared memory computation; Finite field; Dataflow task dependencies; NUMA architecture; Rank deficiency;

机译：PLUQ分解并行共享内存计算有限字段数据流任务依赖NUMA体系结构秩不足;

相似文献

外文文献
中文文献
专利

1. Recursion leads to automatic variable blocking for dense linear-algebra algorithms [J] . IBM Journal of Research and Development . 1997,第6期

机译：递归导致稠密线性代数算法的自动变量阻塞
2. Parallel Architecture for the Solution of Linear Equations Systems Based on Division Free Gaussian Elimination Method Implemented in FPGA [J] . R. MARTINEZ, D. TORRES, M. MADRIGAL, WSEAS Transactions on Circuits and Systems . 2009,第10a12期

机译：FPGA中基于除法高斯消除法的线性方程组解的并行架构
3. Parallelizing dense and banded linear algebra libraries using SMPSs [J] . Rosa M. Badia, Jose R. Herrero, Jesus Labarta, Concurrency and Computation . 2009,第18期

机译：使用SMPS并行化稠密和带状线性代数库
4. Reducing the Time to Tune Parallel Dense Linear Algebra Routines with Partial Execution and Performance Modeling [C] . Piotr Luszczek, Jack Dongarra International conference on parallel processing and applied mathematics . 2012

机译：通过部分执行和性能建模来减少调整并行密集线性代数例程的时间
5. Exact Results Regarding the Physics of Complex Systems via Linear Algebra, Hidden Markov Models, and Information Theory. [D] . Riechers, Paul Michael. 2016

机译：关于通过线性代数，隐马尔可夫模型和信息论进行的复杂系统物理的精确结果。
6. The pharmacokinetic modelling of GI198745 (dutasteride) a compound with parallel linear and nonlinear elimination [O] . Per Olsson Gisleskog, David Hermann, Margareta Hammarlund-Udenaes, 1999

机译：具有平行线性和非线性消除作用的化合物GI198745（地他雄胺）的药代动力学模型
7. Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination [O] . Dumas, Jean-Guillaume, Gautier, Thierry, Pernet, Clément, 2016

机译：基于递归的精确密集线性代数例程的高斯消除并行化
8. Gaussian Elimination for Dense Systems on STAR and a New Parallel Algorithm for Diagonally Dominant Tridiagonal Systems [R] . Thomas L. Jordan 1975

机译：sTaR上密集系统的高斯消元法和对角占优的三对角系统的新并行算法

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

摘要

著录项

相似文献

相关主题

期刊订阅