首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >ACR: Automatic checkpoint/restart for soft and hard error protection
【24h】

ACR: Automatic checkpoint/restart for soft and hard error protection

机译:ACR:自动检查点/重新启动,可进行软错误和硬错误保护

获取原文

摘要

As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.
机译:随着机器规模的增加,许多研究人员已经预测故障率将相应增加。软错误不会阻止执行,但可能会静默生成错误结果。最近的趋势表明,软错误率正在增加,因此必须对其进行检测和处理以保持正确性。我们提出了一种整体方法,可在最少的应用程序干预下自动检测软故障或硬故障并从中恢复。 ACR证明了这一点:一个自动检查点/重新启动框架,该框架执行应用程序复制并使用有关当前故障率的在线信息自动调整检查点时间。 ACR执行应用程序和用户无关的恢复。我们通过注入针对五个应用程序遵循不同分布的故障来对ACR进行经验测试,并在扩展至131,072个内核时显示出较低的开销。我们还分析了软错误和硬错误之间的相互作用,并提出了三种恢复方案,以探讨性能和可靠性要求之间的权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号