首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing
【24h】

SPBC: Leveraging the characteristics of MPI HPC applications for scalable checkpointing

机译:SPBC:利用MPI HPC应用程序的特性进行可扩展的检查点

获取原文

摘要

The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message-passing application but sudder from scalability issues at extreme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order relation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we design a protocol that combines an unprecedented set of features. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process checkpoints, and this, without penalizing recovery performance. Experiments run with a representative set of HPC workloads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
机译:未来超级计算机预期的高故障率要求设计新的容错解决方案。大多数检查点协议旨在使用任何消息传递应用程序,而是从极度缩放的可伸缩性问题工作。我们采取了不同的方法:我们确定了许多HPC应用程序,即通道确定主义的一个属性,并引入了新的部分顺序关系,始终发生在此类应用程序的事件之前。利用这两个概念,我们设计了一个组合前所未有的功能集的协议。我们的协议称为SPBC在分层方式协调检查点和消息日志记录中组合。它是第一个提供故障密封的协议,而无需将任何信息与流程检查点相比可靠地记录,而且不会惩罚恢复性能。通过代表性的HPC工作负载运行的实验展示了在无故障执行和恢复期间的协议的良好表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号