首页> 外文期刊>Journal of supercomputing >Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions
【24h】

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

机译:使用预测和多个主动措施优化HPC系统的容错开销

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

The complexity and scale of high-performance computer systems are rapidly increasing, so fault tolerance is becoming a critical challenge. In this paper, we consider the impact of multiple proactive actions on proactive fault tolerance and periodic checkpointing. We extended Aupy's model in the presence of multiple proactive actions, including proactive checkpointing and task migration. We then propose optimal strategies for deciding when to trust predictions, and provide algorithms for the optimal storage interval for periodic checkpointing. The results show that the proposed method can significantly improve system productivity. Our case study indicates that the recall of the predictor is more important for small platforms, and that precision becomes increasingly important as the scale of the system increases.
机译:高性能计算机系统的复杂性和规模正在迅速增加,因此容错能力正成为一项严峻的挑战。在本文中,我们考虑了多个主动措施对主动容错和定期检查点的影响。我们在存在多个主动动作(包括主动检查点和任务迁移)的情况下扩展了Aupy模型。然后,我们提出了确定何时信任预测的最佳策略,并提供了用于定期检查点的最佳存储间隔的算法。结果表明,该方法可以显着提高系统生产率。我们的案例研究表明,对于小型平台,预测器的召回更为重要,并且随着系统规模的增加,精度也变得越来越重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号