首页> 外文期刊>ACM transactions on mathematical software >A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems
【24h】

A QDWH-based SVD Software Framework on Distributed-memory Manycore Systems

机译:分布式内存Manycore系统上基于QDWH的SVD软件框架

获取原文
获取原文并翻译 | 示例

摘要

This article presents a high-performance software framework for computing a dense SVD on distributed-memory manycore systems. Originally introduced by Nakatsukasa et al. (2010) and Nakatsukasa and Higham (2013), the SVD solver relies on the polar decomposition using the QR Dynamically Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying on the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-WIN. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold speedups on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well- and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.
机译:本文提出了一种高性能的软件框架,用于在分布式内存多核系统上计算密集的SVD。最初由Nakatsukasa等人介绍。 (2010)和Nakatsukasa和Higham(2013),SVD求解器依赖于使用QR动态加权Halley算法(QDWH)的极坐标分解。尽管与传统的SVD相比,基于QDWH的SVD算法执行了一次阶段的对角线缩减,但它执行了大量的浮点运算,但是与级别3 BLAS计算绑定内核相关的固有高并发性最终弥补了该算法的不足。复杂性开销。与矩形处理器拓扑一起使用ScaLAPACK二维块循环数据分布,所得的QDWH-SVD进一步减少了面板分解期间的过多通信,同时在尾随子矩阵更新期间增加了并行度,这与依赖于默认的方形处理器网格。在详细介绍了算法的复杂性和算法的内存占用量之后,我们进行了全面的性能分析,并通过查看通信和计算配置文件trade-WIN来研究网格拓扑对性能的影响。我们根据现有的最新QDWH软件实现(例如Elemental)及其在基于商用Intel x86 Haswell处理器和Knights Landing(KNL)架构的大规模分布式内存多核系统上的SVD扩展报告了性能结果。 QDWH-SVD框架相对于ScaLAPACK PDGESVD,在基于Haswell / KNL的平台上分别实现了高达3/8倍的提速,并且证明它是条件良好和病态矩阵的竞争替代品。我们最终在此基于这些经验结果提出了一种绩效模型。我们基于QDWH的极坐标分解及其SVD扩展分别在https://github.com/ecrc/qdwh.git和https://github.com/ecrc/ksvd.git上免费提供,并已集成到Cray Scientific数值库LibSci v17.11.1。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号