首页> 外文期刊>Concurrency and computation: practice and experience >Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
【24h】

Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

机译:使用局部旋转的递归图块LU分解实现数值精度和高性能

获取原文
获取原文并翻译 | 示例

摘要

The LU factorization is an important numerical algorithm for solving systems of linear equations in sciencernand engineering and is a characteristic of many dense linear algebra computations. For example, it hasrnbecome the de facto numerical algorithm implemented within the LINPACK benchmark to rank the mostrnpowerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue tornpresent challenges to the development of fast and robust numerical software due to the increasing levelsrnof hardware parallelism and widening gap between core and memory speeds. In this context, the difficultyrnin developing new algorithms for the scientific community resides in the combination of two goals:rnachieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposesrna new approach for computing the LU factorization in parallel on multicore architectures, which not onlyrnimproves the overall performance but also sustains the numerical quality of the standard LU factorizationrnalgorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive andrnhighly parallel, the inherently problematic portion of the LU factorization is the panel factorization due tornits memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approachrnuses a parallel fine-grained recursive formulation of the panel factorization step and implements the updaternof the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and locklessrnsynchronization mechanisms, our implementation lets the overall computation flow naturally withoutrncontention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneousrngranularities and to transparently introduce algorithmic lookahead. The performance results of our implementationrnare competitive compared to the currently available software packages and libraries. For example,rnit is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster thanrnLAPACK with multithreaded Intel MKL BLAS.
机译:LU分解是科学和工程学中求解线性方程组的重要数值算法,并且是许多密集线性代数计算的特征。例如,它已成为LINPACK基准内实施的事实上的数值算法,以对TOP500网站收集的世界上功能最强大的超级计算机进行排名。由于硬件并行性水平的提高以及内核与内存速度之间差距的不断扩大,多核处理器继续对快速,强大的数字软件的开发提出挑战。在这种情况下,为科学界开发新算法的困难在于两个目标的结合:在保持数值算法准确性的同时实现高性能。本文提出了一种在多核架构上并行计算LU分解的新方法,该方法不仅可以提高整体性能,而且还可以通过部分旋转来维持标准LU分解算法的数值质量。尽管尾部子矩阵的更新需要大量计算并且高度并行,但是LU因式分解的固有问题部分是面板因式分解,这归因于其内存绑定特性以及选择适当枢轴的原子性。我们的方法使用面板分解步骤的并行细粒度递归公式,并使用tile算法实现尾随子矩阵的updaternof。基于数据的无冲突分区和无锁同步机制,我们的实现使整个计算流程自然而无争执。然后,称为QUARK的动态运行时系统能够调度具有异构粒度的任务,并透明地引入算法前瞻。与当前可用的软件包和库相比,我们实施的性能结果具有竞争力。例如,与等效的Intel MKL例程相比,rnit的速度提高了40%,比多线程Intel MKL BLAS的rnLAPACK的速度提高了三倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号