首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >On coordinated checkpointing in distributed systems
【24h】

On coordinated checkpointing in distributed systems

机译:关于分布式系统中的协调检查点

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems.
机译:通过在稳定的存储上保留一致的全局检查点,协调的检查点简化了故障恢复并消除了发生故障时的多米诺效应。然而,该方法遭受与检查点过程相关联的高开销。有两种方法可以减少开销:一种是最小化同步消息的数量和检查点的数量,另一种是使检查点过程无阻塞。这两种方法在前几年是正交的,直到Prakash-Singhal算法将它们组合在一起为止。换句话说,Prakash-Singhal算法仅强制执行最小数量的进程来获取检查点,并且不会阻塞基础计算。但是,我们在该算法中发现了两个问题。在本文中,我们确定了这些问题并证明了更一般的结果:不存在一种非阻塞算法,该算法仅强制最小数量的进程获取其检查点。基于此一般结果,我们提出了一种有效的算法,该算法既不会强制所有进程获取检查点,也不会在检查点期间阻塞基础计算。此外,我们指出了为分布式计算系统设计协调检查点算法的未来研究方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号