首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers
【24h】

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

机译:迭代求解器中检测到的错误的异步和精确正向恢复

获取原文
获取原文并翻译 | 示例

摘要

Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g., by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess. We introduce a novel error correction technique for iterative linear solvers, designed to preserve both the progress made and the solver's future convergence by recovering the program's state exactly. Leveraging the asynchrony of task-based programming models, we mask our technique's overhead by overlapping error correction with the solver's normal workload. Our technique relies on analysing solvers to find redundancy in the form of relations between data. We are then able to restore discarded or corrupted data by recomputing or inverting the appropriate relations. We demonstrate that this approach allows to recover any part of three widely used Krylov subspace methods: CG, GMRES and BiCGStab, and their pre-conditioned versions. We implement our technique for CG and recover lost data at the scale of a memory page, which is the granularity at which Operating Systems (OS) report memory errors on commodity hardware, and study the effect of varying the memory page size to address non-standard sizes and the possible use of huge pages in High Performance Computing (HPC). When compared to checkpointing and to the state-of-the-art algorithmic restart technique, on small (8 cores) to large scale (1024 cores), our methods show less overhead. A trade-off arises between our straightforward and asynchronous approaches, based on the rate at which faults happen. At the lowest considered rate and page size, overlapping recoveries decreases their average cost from 5.40 to 2.24 percent of the ideal faultless execution time. Our methods generally outperform the state-of-the-art even with increased overheads on big page sizes, and perform similarly on edge cases. These results also indicate that our techniques are increasingly efficient as the matrix size increases.
机译:当前的趋势和预测表明,计算机系统中的故障变得越来越普遍。可以例如通过纠错码(ECC)来检测并且可能透明地纠正这种错误。为了使程序具有容错能力,它还需要处理被检测和未纠正的错误(DUE),例如ECC在代码字中遇到太多的位翻转。虽然纠正错误本身就有开销,但它也可能影响程序的进度。最通用的技术是将程序状态回滚到先前采用的检查点,此后回退所有已完成的进度。或者,存在特定于应用程序的技术,例如以其最新迭代值作为初始猜测值重新启动迭代程序。我们为迭代线性求解器引入了一种新颖的纠错技术,旨在通过精确地恢复程序的状态来保留已取得的进展和求解器的未来收敛。利用基于任务的编程模型的异步性,我们通过将纠错与求解程序的正常工作量重叠来掩盖技术的开销。我们的技术依靠分析求解器来发现数据之间关系形式的冗余。然后,我们可以通过重新计算或反转适当的关系来恢复丢弃或损坏的数据。我们证明了这种方法可以恢复三种广泛使用的Krylov子空间方法的任何部分:CG,GMRES和BiCGStab,以及它们的预处理版本。我们实施CG技术并以内存页的规模恢复丢失的数据,内存页是操作系统(OS)报告商品硬件上的内存错误的粒度,并研究了更改内存页大小以解决非内存问题的影响。标准尺寸以及在高性能计算(HPC)中可能使用的大页面。与检查点和最新的算法重启技术相比,无论是小型(8核)还是大规模(1024核),我们的方法都显示出更少的开销。基于故障发生的速率,我们的直接方法与异步方法之间会产生一个折衷。以最低的考虑的速度和页面大小,重叠的恢复将其平均成本从理想无故障执行时间的5.40%降低到2.24%。即使在大页面尺寸上增加了开销,我们的方法通常也比最新技术要好,并且在边缘情况下的性能类似。这些结果还表明,随着矩阵尺寸的增加,我们的技术越来越有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号