...
首页> 外文期刊>Parallel Computing >Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination
【24h】

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

机译:基于递归的精确密集线性代数例程的高斯消除并行化

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies. (C) 2015 Elsevier B.V. All rights reserved.
机译:我们提出了块算法及其在共享存储架构上实现亚三次高斯消除并行化的实现。与并行数值线性代数中的经典三次算法相反,我们在这里集中讨论递归算法和粗粒度并行化。确实,亚三次矩阵算术只能通过递归算法来实现,这使得粗粒块算法比细粒算法更有效。这项工作是由有限域上的密集线性代数的设计和实现所推动的,在该领域中广泛使用快速矩阵乘法,而昂贵的模数归约法也倡导粗粒分解。我们逐步构建高效的内核,首先进行矩阵乘法,然后进行三角系统求解,然后在其上构建递归PLUQ分解算法。我们使用几种算法变体研究这些内核的并行化:迭代或递归,并使用不同的拆分策略。实验表明,矩阵乘法的递归自适应方法,三角系统求解的混合递归迭代方法以及PLUQ分解的平铺递归版本以及各种数据映射策略,在32核NUMA架构上提供了最佳性能。总的来说,我们证明了模块化约简的开销远远超过了快速线性代数算法所能弥补的,并且即使在存在秩不足的情况下,精确的稠密线性代数也能与全秩参考数值软件的性能相匹配。 (C)2015 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号