首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales
【24h】

Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales

机译:执行尺度不确定的多级检查点模型的优化

获取原文

摘要

Future extreme-scale systems are expected to experience different types of failures affecting applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this paper, we propose a multilevel checkpoint model by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is threefold: (1) we provide an in-depth analysis on why it is difficult to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously, (2) we devise a novel method that can quickly obtain an optimized solution -- the first successful attempt in multilevel checkpoint models with uncertain scales, and (3) we perform both large scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. The experiments confirm that our optimized solution outperforms other state of-the-art solutions by 4.3 -- 88% on wall-clock length.
机译:预期未来的超大规模系统会经历不同类型的故障,从而影响具有不同故障规模的应用程序,从过程中的瞬时不可纠正的内存错误到大规模的系统停机。在本文中,我们通过考虑不确定的执行规模(不同的进程/核心数),提出了一个多级检查点模型。贡献是三方面的:(1)我们深入分析了为什么难以得出不同检查点级别的最佳检查点间隔并同时优化核数的原因;(2)我们设计了一种可以快速获得的新颖方法一种优化的解决方案-在不确定规模的多级检查点模型中的首次成功尝试,并且(3)我们执行大规模的实际实验和极限规模的数值模拟,以验证我们设计的有效性。实验证实,我们的优化解决方案在墙上时钟长度方面比其他最新解决方案要好4.3-88%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号