首页> 外文会议>2010 International Conference on High Performance Computing and Simulation >Using replication and checkpointing for reliable task management in computational Grids
【24h】

Using replication and checkpointing for reliable task management in computational Grids

机译:使用复制和检查点在计算网格中进行可靠的任务管理

获取原文

摘要

In large-scale Grid computing environments, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for the Grids or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on the Grids. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly reduce the number of replications and improve scalability compared with existing mechanisms.
机译:在大型网格计算环境中,科学计算和文件共享需要提供容错性,以提高其可靠性。在以前的作品中,提出了几种机制,用于网格或分布式计算系统。但是,其中一些仅使用了空间冗余(硬件复制),其他人仅使用时间冗余(检查点和回滚)。因此,在网格上的资源利用方面,现有机制效率低。艺术的主要目标是通过使用每次复制的检查点和回滚方案来减少复制的数量。在本领域中,基于对每个任务的给定期限和可靠性要求的成功执行概率的分析,自适应地选择最小的复制次数。我们的仿真结果表明,与现有机制相比,艺术可以显着减少复制的数量,提高可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号