Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

Luc Jaulmes; Miquel Moretó; Eduard Ayguadé; Jesús Labarta; Mateo Valero; Marc Casas

首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

【24h】

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

机译：迭代求解器中检测到的错误的异步和精确正向恢复

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, e.g., by Error Correcting Codes (ECC). For a program to be fault-tolerant, it needs to also handle the Errors that are Detected and Uncorrected (DUE), such as an ECC encountering too many bit flips in a codeword. While correcting an error has an overhead in itself, it can also affect the progress of a program. The most generic technique, rolling back the program state to a previously taken checkpoint, sets back any progress done since then. Alternately, application specific techniques exist, such as restarting an iterative program with its latest iteration's values as initial guess. We introduce a novel error correction technique for iterative linear solvers, designed to preserve both the progress made and the solver's future convergence by recovering the program's state exactly. Leveraging the asynchrony of task-based programming models, we mask our technique's overhead by overlapping error correction with the solver's normal workload. Our technique relies on analysing solvers to find redundancy in the form of relations between data. We are then able to restore discarded or corrupted data by recomputing or inverting the appropriate relations. We demonstrate that this approach allows to recover any part of three widely used Krylov subspace methods: CG, GMRES and BiCGStab, and their pre-conditioned versions. We implement our technique for CG and recover lost data at the scale of a memory page, which is the granularity at which Operating Systems (OS) report memory errors on commodity hardware, and study the effect of varying the memory page size to address non-standard sizes and the possible use of huge pages in High Performance Computing (HPC). When compared to checkpointing and to the state-of-the-art algorithmic restart technique, on small (8 cores) to large scale (1024 cores), our methods show less overhead. A trade-off arises between our straightforward and asynchronous approaches, based on the rate at which faults happen. At the lowest considered rate and page size, overlapping recoveries decreases their average cost from 5.40 to 2.24 percent of the ideal faultless execution time. Our methods generally outperform the state-of-the-art even with increased overheads on big page sizes, and perform similarly on edge cases. These results also indicate that our techniques are increasingly efficient as the matrix size increases.

机译：当前的趋势和预测表明，计算机系统中的故障变得越来越普遍。可以例如通过纠错码（ECC）来检测并且可能透明地纠正这种错误。为了使程序具有容错能力，它还需要处理被检测和未纠正的错误（DUE），例如ECC在代码字中遇到太多的位翻转。虽然纠正错误本身就有开销，但它也可能影响程序的进度。最通用的技术是将程序状态回滚到先前采用的检查点，此后回退所有已完成的进度。或者，存在特定于应用程序的技术，例如以其最新迭代值作为初始猜测值重新启动迭代程序。我们为迭代线性求解器引入了一种新颖的纠错技术，旨在通过精确地恢复程序的状态来保留已取得的进展和求解器的未来收敛。利用基于任务的编程模型的异步性，我们通过将纠错与求解程序的正常工作量重叠来掩盖技术的开销。我们的技术依靠分析求解器来发现数据之间关系形式的冗余。然后，我们可以通过重新计算或反转适当的关系来恢复丢弃或损坏的数据。我们证明了这种方法可以恢复三种广泛使用的Krylov子空间方法的任何部分：CG，GMRES和BiCGStab，以及它们的预处理版本。我们实施CG技术并以内存页的规模恢复丢失的数据，内存页是操作系统（OS）报告商品硬件上的内存错误的粒度，并研究了更改内存页大小以解决非内存问题的影响。标准尺寸以及在高性能计算（HPC）中可能使用的大页面。与检查点和最新的算法重启技术相比，无论是小型（8核）还是大规模（1024核），我们的方法都显示出更少的开销。基于故障发生的速率，我们的直接方法与异步方法之间会产生一个折衷。以最低的考虑的速度和页面大小，重叠的恢复将其平均成本从理想无故障执行时间的5.40％降低到2.24％。即使在大页面尺寸上增加了开销，我们的方法通常也比最新技术要好，并且在边缘情况下的性能类似。这些结果还表明，随着矩阵尺寸的增加，我们的技术越来越有效。

著录项

来源
《Parallel and Distributed Systems, IEEE Transactions on》 |2018年第9期|1961-1974|共14页
作者
Luc Jaulmes; Miquel Moretó; Eduard Ayguadé; Jesús Labarta; Mateo Valero; Marc Casas;
展开▼
作者单位

Barcelona Supercomputing Center (BSC), Barcelona, Spain;

Barcelona Supercomputing Center (BSC), Barcelona, Spain;

Barcelona Supercomputing Center (BSC), Barcelona, Spain;

Barcelona Supercomputing Center (BSC), Barcelona, Spain;

Barcelona Supercomputing Center (BSC), Barcelona, Spain;

Barcelona Supercomputing Center, Barcelona, Spain;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Error correction codes; Hardware; Redundancy; Programming; Program processors; Registers;

机译：纠错码;硬件;冗余;编程;程序处理器;寄存器;

相似文献

外文文献
中文文献
专利

1. Stopping criteria, forward and backward errors for perturbed asynchronous linear fixed point methods in finite precision [J] . Miellou JC, Spiteri P, El Baz D IMA Journal of Numerical Analysis . 2005,第3期

机译：有限精度摄动异步线性不动点方法的停止准则，前向和后向误差
2. An exact method for estimating maximum errors of multi-mode floating-point iterative booth multiplier [J] . Kun-Yi Wu, Shiann-Rong Kuang, Kee-Khuan Yu International Journal of Computational Science and Engineering . 2013,第4期

机译：估计多模浮点迭代小数乘数最大误差的精确方法
3. Exact and Approximated Outage Probability Analyses for Decode-and-Forward Relaying System Allowing Intra-Link Errors [J] . Zhou X., Cheng M., He X., Wireless Communications, IEEE Transactions on . 2014,第12期

机译：允许链接内错误的解码转发中继系统的精确和近似中断概率分析
4. Exploiting asynchrony from exact forward recovery for DUE in iterative solvers [C] . Luc Jaulmes, Marc Casas, Miquel Moretó, International Conference for High Performance Computing, Networking, Storage and Analysis . 2015

机译：在迭代求解器中利用DUE的精确正向恢复中的异步性
5. Resilient Iterative Linear Solvers Running Through Errors. [D] . Elliott, James John, III. 2015

机译：通过误差运行的弹性迭代线性求解器。
6. A 3D Finite-Difference BiCG Iterative Solver with the Fourier-Jacobi Preconditioner for the Anisotropic EIT/EEG Forward Problem [O] . Sergei Turovets, Vasily Volkov, Aleksej Zherdetsky, 2014

机译：具有各向异性EIT / EEG正向问题的带有Fourier-Jacobi前置条件的3D有限差分BiCG迭代求解器
7. {Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers} [O] . Fasi, Massimilinao, Robert, Yves, {Uçar}, Bora 2015

机译：{结合向后和向前恢复以应对迭代求解器中的无声错误}

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅