首页> 外文期刊>Future generation computer systems >Local rollback for resilient MPI applications with application-level checkpointing and message logging
【24h】

Local rollback for resilient MPI applications with application-level checkpointing and message logging

机译:具有应用程序级检查点和消息记录功能的弹性MPI应用程序的本地回滚

获取原文
获取原文并翻译 | 示例

摘要

The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a more localized scope and its impact is usually restricted to a subset of the resources being used. Thus, a global rollback would result in unnecessary overhead and energy consumption, since all processes, including those unaffected by the failure, discard their state and roll back to the last checkpoint to repeat computations that were already done. The User Level Failure Mitigation (ULFM) interface - the last proposal for the inclusion of resilience features in the Message Passing Interface (MPI) standard - enables the deployment of more flexible recovery strategies, including localized recovery. This work proposes a local rollback approach that can be generally applied to Single Program, Multiple Data (SPMD) applications by combining ULFM, the Compiler for Portable Checkpointing (CPPC) tool, and the Open MPI VProtocol system-level message logging component. Only failed processes are recovered from the last checkpoint, while consistency before further progress in the execution is achieved through a two-level message logging process. To further optimize this approach point-to-point communications are logged by the Open MPI VProtocol component, while collective communications are optimally logged at the application level-thereby decoupling the logging protocol from the particular collective implementation. This spatially coordinated protocol applied by CPPC reduces the log size, the log memory requirements and overall the resilience impact on the applications. (C) 2018 Elsevier B.V. All rights reserved.
机译:高性能计算(HPC)中通常使用的弹性方法依赖于协调的检查点/重新启动,这是运行该应用程序的所有进程的全局回滚。但是,在许多情况下,故障的范围更加局限,其影响通常仅限于所用资源的一部分。因此,全局回滚将导致不必要的开销和能耗,因为所有进程(包括不受故障影响的进程)都将放弃其状态,并回滚到最后一个检查点以重复已完成的计算。用户级别故障缓解(ULFM)接口是在消息传递接口(MPI)标准中包含弹性功能的最新建议,它可以部署更灵活的恢复策略,包括本地化恢复。这项工作提出了一种本地回滚方法,通过结合ULFM,便携式检查点编译器(CPPC)工具和Open MPI VProtocol系统级消息记录组件,可以将其普遍应用于单程序,多数据(SPMD)应用程序。从失败的最后一个检查点仅恢复失败的进程,而通过两级消息记录过程实现执行进一步进展之前的一致性。为了进一步优化此方法,Open MPI VProtocol组件记录了点对点通信,而集体通信则在应用程序级别进行了最佳记录,从而使记录协议与特定的集体实现脱钩。 CPPC应用的这种空间协调协议减少了日志大小,日志存储需求以及总体上对应用程序的弹性影响。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号