首页> 外文会议>The 39th International Conference on Parallel Processing >Checkpointing vs. Migration for Post-Petascale Supercomputers
【24h】

Checkpointing vs. Migration for Post-Petascale Supercomputers

机译:后等规模超级计算机的检查点与迁移

获取原文

摘要

An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 2^{20} nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.
机译:对于大型集群,经典容错方法的替代方法是避免故障,通过该方法可以预测故障的发生并采取预防措施。我们针对两种类型的预防措施开发分析性能模型:预防性检查点和预防性迁移。我们还开发了标准周期性检查点容错方法的性能分析模型。我们针对代表当前和未来技术趋势的平台方案实例化这些模型。我们发现,从短期来看,预防性迁移是更好的方法。但是,从长远来看,这两种方法都具有可比较的优点,但在预防性检查点方面却具有边际优势。我们还发现,与基于预测的故障避免相比,基于标准的非预测的容错能力实现了较差的缩放,从而证明了故障预测功能的重要性。最后,我们的结果表明,在真正的大型机器(例如2 ^ {20}节点)中为并行工作负载实现良好的利用率将比在本工作中评估的避免故障技术需要更多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号