首页> 外文期刊>Journal of supercomputing >Resilient MPI applications using an application-level checkpointing framework and ULFM
【24h】

Resilient MPI applications using an application-level checkpointing framework and ULFM

机译:使用应用程序级检查点框架和ULFM的弹性MPI应用程序

获取原文
获取原文并翻译 | 示例
           

摘要

Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, within the MPI forum, has presented the User Level Failure Mitigation (ULFM) proposal, providing new functionalities for the implementation of resilient MPI applications. In this work, the CPPC checkpointing framework is extended to exploit the new ULFM functionalities. The proposed solution transparently obtains resilient MPI applications by instrumenting the original application code. Besides, a multithreaded multilevel checkpointing, in which the checkpoint files are saved in different memory levels, improves the scalability of the solution. The experimental evaluation shows a low overhead when tolerating failures in one or several MPI processes.
机译:由数百万个内核组成的未来的亿亿级系统将具有较高的故障率,并且长时间运行的应用程序将需要利用新的容错技术来确保成功完成执行。 MPI论坛中的容错工作组提出了“用户级故障缓解(ULFM)”建议,为实施弹性MPI应用程序提供了新功能。在这项工作中,CPPC检查点框架得到了扩展,以利用新的ULFM功能。所提出的解决方案通过检测原始应用程序代码来透明地获得弹性MPI应用程序。此外,将检查点文件保存在不同内存级别的多线程多级别检查点,提高了解决方案的可伸缩性。实验评估显示,当容忍一个或多个MPI流程中的故障时,开销较低。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号