首页> 外文会议>International conference on algorithms and architectures for parallel processing >Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System
【24h】

Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System

机译:比较集群系统中的检查点和回滚恢复方案

获取原文

摘要

Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.
机译:集群系统在以相对较低的成本实现高性能计算方面起着核心作用,同时,对于实际使用来说,容错功能也是必需的。在本文中,我们开发了随机模型,以评估具有三种众所周知的检查点和回滚恢复方案的集群计算系统的预期总恢复开销。检查点镜像,中央文件服务器检查点和偏斜检查点,其中系统故障后的故障延迟时间由随机变量给出。通常,由于在群集系统中可能会发生多节点故障以及单节点故障,因此要获得预期的总恢复开销的封闭形式并不是那么容易。基于简单的故障模型,我们通过列出由多节点故障引起的概率事件的所有可能组合来实现此目的。此外,我们将各个预期的总恢复开销与不同的检查点和回滚恢复方案进行了比较,并定量评估了这些方案的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号