首页> 外文会议>Computational Science - ICCS 2007 pt.1; Lecture Notes in Computer Science; 4487 >Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication
【24h】

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

机译:通过自适应检查点和复制在不可靠的网格系统中提供容错能力

获取原文
获取原文并翻译 | 示例

摘要

As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several novel heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulation-based comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.
机译:由于网格通常由资源变化很大的自治管理子系统组成,因此容错构成了应用程序调度过程的重要方面。在网格中提供容错的两种众所周知的技术是定期任务检查点和复制。两种技术都可以减轻由于更改系统可用性而导致的工作量损失,但会带来大量的运行时开销。后者主要取决于检查点间隔的长度和所选副本数。本文提出了一种动态调度算法,可以在定期检查点和复制之间进行切换,以利用这两种技术的优势并减少开销。此外,讨论了几种新颖的启发式方法,这些方法可根据有关资源行为的历史信息对检查点期间进行在线自适应调整。提议的组合算法与仅基于检查点和复制的传统策略的基于仿真的比较表明,负载变化的系统的平均任务有效期大大减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号