首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales
【24h】

Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales

机译:通过本地恢复对基于模板的应用程序进行极端规模的建模和模拟多重故障掩盖

获取原文
获取原文并翻译 | 示例
           

摘要

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.
机译:在应用程序级别上获得多进程的硬故障恢复能力是一个关键挑战,必须完全克服亿亿级的承诺。先前的工作表明,与更传统的终止作业并从最后一个存储的检查点重新开始的传统方法相比,联机全局恢复可以显着减少故障的开销。如果以本地方式执行在线恢复,则不仅由于本地恢复的固有成本较低,而且由于使用某些应用程序类型时产生的影响,还可以实现进一步的可伸缩性。在本文中,我们对一种这样的效应进行建模,即多重故障屏蔽,该故障屏蔽在局部恢复故障的环境中运行Stencil并行计算时表现出来。首先,对本地恢复的一个或多个故障的延迟传播形状进行建模,以对某些模板应用行为下不同级别的故障屏蔽概率进行几项分析。我们的结果表明,故障掩盖在规模上是极为理想的效果,随着机器尺寸或故障率的增加,这种表现更加明显且有益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号