首页> 外文期刊>Concurrency and Computation >A survey and review of the current state of rollback-recovery for cluster systems
【24h】

A survey and review of the current state of rollback-recovery for cluster systems

机译:集群系统回滚恢复的当前状态的调查和回顾

获取原文
获取原文并翻译 | 示例

摘要

A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long-running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long-running applications. Several fault tolerance approaches were investigated; however, it was found that rollback-recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback-recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails.
机译:存在许多需要大量时间和计算资源来解决的研究问题。试图解决这些问题会产生长期运行的应用程序,这些应用程序需要可在其上执行的可靠且可信赖的系统。集群系统提供了运行这些应用程序的绝佳环境,因为它们的性价比较低。但是,由于是使用商品组件创建的,因此容易出现故障。该报告调查并审查了当前与为长期运行的应用程序提供容错相关的问题。研究了几种容错方法。但是,发现回滚恢复为集群系统中的用户应用程序提供了一种不错的方法。使用回滚恢复需要两种设施来提供容错能力:检查点和恢复。此处显示,为增强检查点已经完成了许多工作。但是,提供恢复的复杂性已被忽略。与提供恢复相关的问题包括;提供透明和自主的恢复,选择适当的恢复计算机,并在应用程序失败时保持一致的可观察行为。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号