首页> 外文期刊>IEICE Transactions on Information and Systems >WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs
【24h】

WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

机译:WBC-ALC:MPI程序的弱阻塞协调应用程序级检查点

获取原文
获取原文并翻译 | 示例
       

摘要

As supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a fault-tolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.
机译:随着超级计算机规模的增加,系统的平均故障间隔时间(MTBF)越来越短,并且超级计算机的可靠性问题变得越来越严重。 MPI是当前用于构建高性能应用程序的事实上的标准,并且对MPI的容错方法的研究始终是热门话题。但是,由于MPI程序的特性,当前大多数MPI程序的检查点方法都需要修改MPI库(甚至是操作系统),或者通过记录大量消息来实现复杂的协议。在本文中,我们提出了应用程序级检查点(ALC)的思想。基于程序员熟悉应用程序的通信特性这一普遍事实,我们开发了BC-ALC,这是一种用于MPI程序的新型便携式阻塞协调ALC。 BC-ALC既不修改MPI库(甚至是操作系统),也不记录任何消息。它仅通过屏障操作而不是任何复杂的协议来实现协调。此外,为了降低容错成本,我们减小了障碍的同步范围,并基于进程之间的通信关系设计了WBC-ALC,这是一种利用组同步代替全局同步的弱阻塞协调ALC。我们还提出了在WBC-ALC之上开发的容错框架,并讨论了其实现。在NPB3.3-MPI基准上的实验结果验证了BC-ALC和WBC-ALC,并且表明与BC-ALC相比,WBC-ALC中单个检查点的平均协调时间和平均备份时间减少了44.5%,并且分别为5.7%。

著录项

  • 来源
    《IEICE Transactions on Information and Systems》 |2012年第3期|p.786-796|共11页
  • 作者单位

    The authors are with National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China;

    The authors are with National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China;

    The authors are with National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    application-level checkpointing; weak coordinated; MPI; fault tolerance; consistency;

    机译:应用程序级检查点;协调不力;MPI;容错;一致性;
  • 入库时间 2022-08-18 00:26:20

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号