首页> 外文会议>International Conference on Parallel and Distributed Computing >Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms
【24h】

Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

机译:在迭代算法中使用Checkpoint Recoverys的性能有效多辨认

获取原文

摘要

In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.
机译:在本文中,我们解决了建立多路径迭代高性能计算(HPC)应用的设计挑战。 HPC应用中的多泄漏是在软错误和过程故障的存在下忍受和保持前瞻性进展的能力。我们通过提出可用于设计性能有效和弹性迭代应用的性能模型来解决挑战。模型考虑软误差和过程故障弹性解决方案之间的交互。我们尝试使用具有两个不同类型的软错误探测器的线性求解器应用:一个检测器具有高的开销和高精度,而第二个探测器具有较低的开销和低精度。我们展示了如何利用两者用于验证用于从软错误和过程故障中恢复的检查点状态的完整性。我们的结果表明,在检查点间隔内使用高频率的低次射频检测器的性能效率和弹性效益,从而可以进行及时的软错误恢复,导致更少的重新计算工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号