Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales

机译：探索适用于极端规模的科学应用程序的自动在线故障恢复

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Application resilience is a key challenge that must be addressed in order to realize the exascale vision. Processode failures, an important class of failures, are typically handled today by terminating the job and restarting it from the last stored checkpoint. This approach is not expected to scale to exascale. In this paper we present Fenix, a framework for enabling recovery from processode/blade/cabinet failures for MPI-based parallel applications in an online (i.e., Without disrupting the job) and transparent manner. Fenix provides mechanisms for transparently capturing failures, re-spawning new processes, fixing failed communicators, restoring application state, and returning execution control back to the application. To enable automatic data recovery, Fenix relies on application-driven, diskless, implicitly coordinated check pointing. Using the S3D combustion simulation running on the Titan Cray-XK7 production system at ORNL, we experimentally demonstrate Felix's ability to tolerate high failure rates (e.g., More than one per minute) with low overhead while sustaining performance.

机译：应用弹性是实现百亿亿级愿景所必须解决的关键挑战。流程/节点故障是一类重要的故障，通常今天通过终止作业并从最后存储的检查点重新启动来处理。预计该方法不会扩展到百亿亿美元。在本文中，我们介绍了Fenix，这是一个框架，用于以在线方式（即不中断工作）以透明方式从基于MPI的并行应用程序的进程/节点/刀片/机柜故障中恢复。 Fenix提供了用于透明地捕获故障，重新生成新进程，修复故障的通信器，恢复应用程序状态以及将执行控制返回给应用程序的机制。为了实现自动数据恢复，Fenix依赖于应用程序驱动的无盘隐式协调检查点。通过在ORNL的Titan Cray-XK7生产系统上运行的S3D燃烧模拟，我们通过实验证明了Felix能够以较低的开销容忍高故障率（例如，每分钟超过一个），同时保持性能。

著录项

来源
《International Conference for High Performance Computing, Networking, Storage and Analysis》|2014年|895-906|共12页
会议地点
作者
Gamell Marc; Katz Daniel S.; Kolla Hemanth; Chen Jiann-Jong; Klasky Scott; Parashar Manish;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
application program interfaces; checkpointing; parallel processing; Fenix; MPI-based parallel application; S3D combustion simulation; application resilience; automatic data recovery; check pointing; exascale vision; extreme scales; node failures; online failure recovery; process-node-blade-cabinet failure; scientific application; Checkpointing; Combustion; Fault tolerance; Fault tolerant systems; Peer-to-peer computing; Runtime; Synchronization;

机译：应用程序接口;检查点;并行处理; Fenix;基于MPI的并行应用程序; S3D燃烧模拟;应用程序弹性;自动数据恢复;检查点;万亿级视觉;极端规模;节点故障;在线故障恢复;过程节点刀片-机柜故障;科学应用;检查点;燃烧;容错;容错系统;点对点计算;运行时;同步;

相似文献

外文文献
中文文献
专利

1. Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations [J] . Dongfang Zhao, Ning Liu, Dries Kimpe, IEEE Transactions on Parallel and Distributed Systems . 2016,第6期

机译：通过系统和仿真来探索极端规模的数据密集型科学应用
2. Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales [J] . Marc Gamell, Keita Teranishi, Jackson Mayo, IEEE Transactions on Parallel and Distributed Systems . 2017,第10期

机译：通过本地恢复对基于模板的应用程序进行极端规模的建模和模拟多重故障掩盖
3. Three-phase microfluidic reactor networks - Design, modeling and application to scaled-out nanoparticle-catalyzed hydrogenations with online catalyst recovery and recycle [J] . Yap Swee Kun, Wong Wai Kuan, Ng Nicholas Xiang Yang, Chemical Engineering Science . 2017,第期

机译：三相微流体反应器网络 - 用刀源性催化剂回收和再循环进行缩放纳米粒子催化氢化的设计，建模和应用
4. Local recovery and failure masking for stencil-based applications at extreme scales [C] . Marc Gamell, Keita Teranishi, Michael A. Heroux, International Conference for High Performance Computing, Networking, Storage and Analysis . 2015

机译：适用于基于模板的应用程序的本地恢复和故障屏蔽功能非常强大
5. Application-Aware On-Line Failure Recovery For Extreme-Scale HPC Environments [D] . Balmana, Marc Gamell. 2017

机译：适用于极端规模HPC环境的应用程序感知在线故障恢复
6. Automatic recognition of conceptualization zones in scientific articles and two life science applications [O] . Maria Liakata, Shyamasree Saha, Simon Dobnik, -1

机译：自动识别科学文章和两个生命科学应用中的概念化区域
7. FusionFS: Toward Supporting Data-Intensive Scientific Applications on Extreme-Scale High-Performance Computing Systems [O] . 2015

机译：FusionFs：支持极端高性能计算系统上的数据密集型科学应用

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales

摘要

著录项

相似文献

相关主题

期刊订阅