首页> 外文会议>IEEE International conference on cluster computing >To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing
【24h】

To checkpoint or not to checkpoint: Understanding energy-performance-I/O tradeoffs in HPC checkpointing

机译:要检查点还是不检查点:了解HPC检查点中的能源性能I / O折衷

获取原文

摘要

As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/ performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.
机译:随着高性能计算(HPC)集群的规模不断增长,其日益严重的故障率和能耗水平正在日益上升,这是两个严重的设计问题,预计它们在未来的Exascale系统中将变得更具挑战性。因此,在如此大规模下高效运行的系统需要深入了解与不同容错技术相关的性能和能源成本。最常用的容错方法是检查点/重新启动。多年来,传统上已经从性能角度对检查点调度策略进行了优化和分析。对于这些策略的能源状况或如何优化它们以节省能源(而不是性能)的了解仍然很少。在本文中,我们对与一系列检查点调度策略相关的能量/性能折衷进行了广泛的分析,包括我们提出的策略以及文献中现有的策略。我们估算给定检查点策略的能源开销,并提供简单的公式来优化检查点调度以节省能源,无论运行时间有无限制。然后,我们基于10个生产HPC群集中的故障日志,使用跟踪驱动的仿真,评估并比较不同方法的运行时优化版本和能源优化版本。我们的结果表明,在使用利用HPC故障特征的非恒定(自适应)检查点方法时,可以以较低的运行时间开销来节省大量能源。我们还分析了能源优化检查点对存储子系统的影响,确定了对于I / O节省更优化的策略,并研究了如何在I / O时间上限制能源的优化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号