【24h】

A Lightweight Message Logging Scheme for Fault Tolerant MPI

机译:容错MPI的轻量级消息记录方案

获取原文
获取原文并翻译 | 示例

摘要

This paper presents a new lightweight logging scheme for MPI to provide fault tolerance. Checkpointing recovery is the most widely used fault tolerance scheme for the distributed systems. However, all the processes should be rolled back and restarted even for a single process failure to preserve consistency. Message logging can be used so that the other processes can proceed unaffected by the failure. However, logging all the messages tends to be prohivitively expensive. We note that the applications programmed using MPI follow certain rules and not all of the messages need to be logged. Our logging scheme is based on this observation and only the absolutely necessary information is logged or piggybacked. As a result, it is possible to greately reduce the logging overhead using our scheme and the experimental results matched well with the expectation.
机译:本文提出了一种新的MPI轻量级日志记录方案,以提供容错能力。检查点恢复是分布式系统中使用最广泛的容错方案。但是,即使对于单个进程失败,也应回滚并重新启动所有进程,以保持一致性。可以使用消息日志记录,以便其他进程可以继续进行而不受故障影响。但是,记录所有消息的成本往往很高。我们注意到,使用MPI编程的应用程序遵循某些规则,并且并非所有消息都需要记录。我们的日志记录方案基于此观察结果,只有绝对必要的信息会被记录或附带。结果,使用我们的方案可以极大地减少测井的开销,并且实验结果与预期非常吻合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号