首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance
【24h】

Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance

机译:基于算法的容错并行化为Hessenberg形式

获取原文

摘要

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
机译:本文研究了双向因式分解的弹性,并提出了一种基于通用算法的方法,该方法能够使双向因式分解具有弹性。我们在Hessenberg Reduction(HR)的背景下建立了该方法的正确性和数值稳定性的理论证明,并给出了实际实现的可伸缩性和性能结果。我们的方法是一种混合算法,结合了基于算法的容错(ABFT)技术和无盘检查点以完全保护数据。我们使用校验和保护矩阵的尾部和初始部分,并使用无盘检查点保护面板范围内的成品面板。与原始HR(ScaLA-PACK PDGEHRD例程)相比,我们的容错算法引入了很少的开销,并保持了相同级别的可伸缩性。我们证明,随着矩阵的大小或过程网格的大小增加,开销显示出减少的趋势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号