首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >Communication-Avoiding Cholesky-QR2 for Rectangular Matrices
【24h】

Communication-Avoiding Cholesky-QR2 for Rectangular Matrices

机译:矩形矩阵的避免通信的Cholesky-QR2

获取原文

摘要

Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of Θ(P^1/6) less interprocessor communication on P processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.
机译:鉴于现代机器内部并行性的不断提高,用于解决最小二乘和特征值问题的可伸缩QR分解算法至关重要。我们介绍了CholeskyQR2算法的更一般的并行化,并展示了其对各种矩阵尺寸的有效性。我们的算法在3D处理器网格上执行,该网格的尺寸可以调整到在同步,处理器间通信,计算工作和内存占用方面的权衡成本。我们实现了该算法,所生成的代码与以前的任何并行QR实现相比,可以在P处理器上实现少Θ(P ^ 1/6)的处理器间通信。我们对Intel Knights-Landing和Cray XE超级计算机的性能研究证明了CholeskyQR2并行化在大量节点上的有效性。具体而言,相对于ScaLAPACK的QR,在Stampede2的1024个节点上,我们的CholeskyQR2实现在强扩展测试中快2.6x-3.3x,在弱扩展测试中快1.1x-1.9x。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号