首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >ACR: Automatic checkpoint/restart for soft and hard error protection
【24h】

ACR: Automatic checkpoint/restart for soft and hard error protection

机译:ACR:用于软和硬错误保护的自动检查点/重启

获取原文

摘要

As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.
机译:随着机器规模的增加,许多研究人员预测失败率相应增加。软错误不会禁止执行,但可能默默地生成不正确的结果。最近的趋势表明,软错误率正在增加,因此必须检测并处理它们以保持正确性。我们提出了一种全部方法,用于自动检测和从柔软或硬断层中检测和恢复具有最小的应用干预。这由ACR展示:执行应用程序复制的自动检查点/重启框架,并使用有关当前故障率的在线信息自动适应检查点期间。 ACR执行申请和用户不知情的恢复。我们通过注入遵循不同分布的故障对五种应用的故障进行经验测试ACR,并在缩放到131,072个核心时显示出低开销。我们还分析了软硬错误之间的互动,并提出了三种恢复方案,探讨了性能和可靠性要求之间的权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号