【24h】

Checkpointing and Recovery Mechanism in Grid

机译:网格中检查点和恢复机制

获取原文

摘要

Grid is a collection of distributed computing resources that performs tasks in co-ordination to achieve high-end computational capabilities by dividing a given task into sub-tasks. Each sub-task could be large and run for several hours or days on a number of grid nodes. If a sub-task fails to complete even on a single site, all the computations should be performed again. In scalable distributed systems, an individual component failure usually does not result in failure of the entire system. The probability of a single component failure rises rapidly with the increase in number of components in the system. As system grows in size, efficient recovery mechanism is most important for highly parallel mission critical and long running applications of grid environment. This paper addresses a recovery mechanism using checkpoints to recover from Grid Service failure resulting in task or transaction failure in Computational or Data Grid which will prevent computations to be restarted from scratch. This work helps in preserving two main objectives of grid namely optimal resource utilization and speedy computations, which can be achieved by using resources in a better way for improving performance of the system rather than engaging them in tasks like rollbacks resulting from cascading aborts. This work aims to address checkpointing mechanism to recover from system failure leading to failure of running services and computational tasks or transactions being executed. The saved state using checkpoints can also be used for job migration using job schedulers of grid.
机译:网格是分布式计算资源的集合,通过将特定任务划分为子任务来实现高端计算能力的协调中的分布式计算资源。每个子任务都可以大,并在多个网格节点上运行几个小时或几天。如果即使在单个站点上也无法完成子任务,则应再次执行所有计算。在可缩放的分布式系统中,各个组件故障通常不会导致整个系统的故障。单个组件故障的概率随系统组件数量的增加而迅速上升。随着系统大小的增长,高效的恢复机制对于高度平行的关键任务至关重要的网格环境应用。本文使用检查点从网格服务故障中恢复,从而恢复了计算或数据网格中的恢复机制,这将阻止从头开始重新启动计算。这项工作有助于保留网格即最佳资源利用率和快速计算的两个主要目标,这可以通过以更好的方式使用资源来提高系统的性能而不是在级联中止导致的回滚中的任务中接合它们。这项工作旨在解决从系统故障中恢复检查点的机制,导致运行服务的失败和正在执行的计算任务或事务。使用CheckPoints的保存状态也可用于使用网格的作业调度程序进行作业迁移。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号