首页> 外文会议>International Green and Sustainable Computing Conference >Monitoring strategies for scalable dynamic checkpointing
【24h】

Monitoring strategies for scalable dynamic checkpointing

机译:用于可伸缩动态检查点的监视策略

获取原文

摘要

Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience runtime.
机译:弹性是超大型超级计算机的一项重要挑战。假定当前超级计算机中的故障在时间上是均匀分布的。但是,最近的研究表明,高性能计算系统中的故障在时间上是部分相关的,从而产生较高的故障密度时段。为了使系统适应新的条件,检测这些时段很重要。在本文中,我们提供了一个监视系统,该系统侦听跨计算节点的硬件事件,并将重要事件转发到容错运行时,以便它可以对那些状态更改做出反应。我们的大规模评估显示了此动态检查点方案的多个方面,这对于理解其在生产系统上的适用性以及确定未来改进的可能途径至关重要。特别是,我们评估了系统监视尽可能多类型的事件,衡量其重要性并将其转发到弹性运行时的能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号