首页> 外文会议>Proceedings of 2010 International Conference on Communication and Computational Intelligence >A hierarchical fault detection and recovery in a computational grid using watchdog timers
【24h】

A hierarchical fault detection and recovery in a computational grid using watchdog timers

机译:使用看门狗定时器的计算网格中的分层故障检测和恢复

获取原文

摘要

Grid computing basically means applying the resources of individual computers in a network to focus on a single problem/task at the same time. But the disadvantage of this feature is that the computers which are actually performing the calculations might not be always trustworthy and may fail periodically. Hence larger the number of nodes in the grid, greater is the probability that a node fails. Hence in order to execute the workflows in a fault tolerant manner we go for fault tolerance and recovery strategies. This paper proposes a method in which the instantaneous snapshot of the local state of processes within each node is recorded. An efficient algorithm is introduced for the detection of the node failures using watch dog timers. For recovery we make use of divide and conquer algorithm that avoids redoing of already completed jobs, enabling faster recovery.
机译:网格计算基本上是指将网络中各个计算机的资源用于同时关注单个问题/任务。但是此功能的缺点是,实际上正在执行计算的计算机可能并不总是可信赖的,并且可能会定期出现故障。因此,网格中的节点数越多,节点发生故障的可能性就越大。因此,为了以容错方式执行工作流,我们采用了容错和恢复策略。本文提出了一种方法,其中记录每个节点内的进程的本地状态的瞬时快照。引入了一种有效的算法,使用看门狗定时器来检测节点故障。为了进行恢复,我们使用了分而治之算法,该算法避免重做已完成的作业,从而实现更快的恢复。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号