首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
【24h】

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales

机译:探索适用于极端规模的科学应用程序的自动在线故障恢复

获取原文

摘要

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Processode failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from processode/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felix's ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.
机译:应用弹性是实现百亿亿级愿景所必须解决的关键挑战。流程/节点故障是一类重要的故障,通常今天通过终止作业并从最后存储的检查点重新启动来处理。预计该方法不会扩展到百亿亿美元。在本文中,我们介绍了Fenix,这是一个框架,用于以在线方式(即不中断工作)以透明方式从基于MPI的并行应用程序的进程/节点/刀片/机柜故障中恢复。 Fenix提供了用于透明地捕获故障,重新生成新进程,修复故障的通信器,恢复应用程序状态以及将执行控制返回给应用程序的机制。为了实现自动数据恢复,Fenix依赖于应用程序驱动的无盘隐式协调检查点。通过在ORNL的Titan Cray-XK7生产系统上运行的S3D燃烧模拟,我们通过实验证明了Felix能够以较低的开销容忍高故障率(例如,每分钟超过一个),同时保持性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号