On coordinated checkpointing in distributed systems

Guohong Cao; Singhal M.

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >On coordinated checkpointing in distributed systems

【24h】

On coordinated checkpointing in distributed systems

机译：关于分布式系统中的协调检查点

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems.

机译：通过在稳定的存储上保留一致的全局检查点，协调的检查点简化了故障恢复并消除了发生故障时的多米诺效应。然而，该方法遭受与检查点过程相关联的高开销。有两种方法可以减少开销：一种是最小化同步消息的数量和检查点的数量，另一种是使检查点过程无阻塞。这两种方法在前几年是正交的，直到Prakash-Singhal算法将它们组合在一起为止。换句话说，Prakash-Singhal算法仅强制执行最小数量的进程来获取检查点，并且不会阻塞基础计算。但是，我们在该算法中发现了两个问题。在本文中，我们确定了这些问题并证明了更一般的结果：不存在一种非阻塞算法，该算法仅强制最小数量的进程获取其检查点。基于此一般结果，我们提出了一种有效的算法，该算法既不会强制所有进程获取检查点，也不会在检查点期间阻塞基础计算。此外，我们指出了为分布式计算系统设计协调检查点算法的未来研究方向。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |1998年第12期|P.1213-1225|共13页
作者
Guohong Cao; Singhal M.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Minimum mutable checkpoint-based coordinated checkpointing protocol for mobile distributed systems [J] . Lalit K. Awasthi, Manoj Misra, R.C. Joshi International journal of communication networks and distributed systems . 2014,第4期

机译：移动分布式系统中基于最小可变检查点的协作检查点协议
2. FNB: Fast Non-Blocking Coordinated Checkpointing Protocol forn Distributed Systems [J] . Abdelhafidi Zohra, Djoudi Mohamed, Lagraa Nasreddine, Theory of computing systems . 2015,第2期

机译：FNB：分布式系统的快速无阻塞协调检查点协议
3. AN EFFICIENT COORDINATED CHECKPOINTING APPROACH FOR DISTRIBUTED COMPUTING SYSTEMS WITH RELIABLE CHANNELS [J] . Lalit K. Awasthi, Manoj Misra, Ramesh C. Joshi International Journal of Computers & Applications . 2012,第1期

机译：具有可靠通道的分布式计算系统的一种有效的协调检查点方法
4. Low Overhead Time Coordinated Checkpointing Algorithm For Mobile Distributed Systems [C] . Jangra Surender, Sejwal Arvind, Kumar Anil, International conference on networks communications . 2013

机译：用于移动分布式系统的低开销时间协调检查点算法
5. Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems. [D] . Hursey, Joshua. 2010

机译：HPC系统上MPI应用程序的协调检查点/重启过程容错能力。
6. Dynamic Clustering and Coordinated User Scheduling for Cooperative Interference Cancellation on Ultra-High Density Distributed Antenna Systems [O] . Kazuki Maruta 2018

机译：用于超高密度分布式天线系统的合作干扰消除的动态聚类和协调用户调度
7. Dealing with Frequent Aborts in Minimum-process Coordinated Checkpointing Algorithm for Mobile Distributed Systems [O] . Parveen Kumar, Preeti Gupta, Anil Kumar Solanki 2011

机译：移动分布式系统最小进程协调检查点算法中的频繁异常处理

On coordinated checkpointing in distributed systems

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅