首页> 外文会议>2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops >Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers
【24h】

Combining Backward and Forward Recovery to Cope with Silent Errors in Iterative Solvers

机译:结合向前和向后恢复以应对迭代求解器中的无提示错误

获取原文
获取原文并翻译 | 示例

摘要

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167 -- 176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with check pointing: the idea is to verify every d iterations, and to checkpoint every c × d iterations. When a silent error is detected by the verification mechanism, one can rollback to, and re-execute from, the last checkpoint. In this paper, we also propose to combine check pointing and verification, but we use ABFT rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the Conjugate Gradient algorithm. Finally, we validate our new approach through a set of simulations.
机译:最近的几篇论文介绍了一种周期性验证机制来检测迭代求解器中的无提示错误。 Chen [PPoPP'13,pp。167-176]展示了如何将这种验证机制(稳定性测试检查两个向量的正交性并重新计算残差)与检查点结合起来:这个想法是验证每d次迭代,并每c×d次迭代检查一次。当验证机制检测到静默错误时,可以回滚到最后一个检查点并从最后一个检查点重新执行。在本文中,我们还建议将检查点和验证结合起来,但是我们使用ABFT而不是稳定性测试。 ABFT可以用于错误检测,也可以用于错误检测和纠正,当检测到单个错误时,可以进行正向恢复(不回滚或重新执行)。我们引入一个抽象性能模型来计算所有方案的性能,并使用共轭梯度算法对其进行实例化。最后,我们通过一组仿真验证了我们的新方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号