Using replication and checkpointing for reliable task management in computational Grids

机译：使用复制和检查点在计算网格中进行可靠的任务管理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In large-scale Grid computing environments, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for the Grids or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on the Grids. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly reduce the number of replications and improve scalability compared with existing mechanisms.

机译：在大型网格计算环境中，科学计算和文件共享需要提供容错性，以提高其可靠性。在以前的作品中，提出了几种机制，用于网格或分布式计算系统。但是，其中一些仅使用了空间冗余（硬件复制），其他人仅使用时间冗余（检查点和回滚）。因此，在网格上的资源利用方面，现有机制效率低。艺术的主要目标是通过使用每次复制的检查点和回滚方案来减少复制的数量。在本领域中，基于对每个任务的给定期限和可靠性要求的成功执行概率的分析，自适应地选择最小的复制次数。我们的仿真结果表明，与现有机制相比，艺术可以显着减少复制的数量，提高可扩展性。

著录项

来源
《2010 International Conference on High Performance Computing and Simulation》|2010年|P.125-131|共7页
会议地点
作者
Yi Sangho; Kondo Derrick; Kim Bongjae; Park Geunyoung; Cho Yookun;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动模拟理论（自动仿真理论）;
关键词
Checkpointing; Computational Grids; Real-time; Reliability; Replication;

机译：检查点;计算网格;实时;可靠性;复制;

相似文献

外文文献
中文文献
专利

1. Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids [J] . Chtepen M., Claeys F.H.A., Dhoedt B., IEEE Transactions on Parallel and Distributed Systems . 2009,第2期

机译：自适应任务检查点和复制：建立高效的容错网格
2. Fault Tolerant Task Scheduling on Computational Grid Using Checkpointing Under Transient Faults [J] . Ritu Garg, Awadhesh Kumar Singh Arabian Journal for Science and Engineering . 2014,第12期

机译：暂态故障下基于检查点的计算网格容错任务调度
3. Reliable management of checkpointing and application data in?opportunistic grids [J] . Raphael Y. de Camargo, Fernando Castor, Fabio Kon Brazilian Computer Society. Journal . 2010,第3期

机译：在机会网格中可靠地管理检查点和应用程序数据
4. Using replication and checkpointing for reliable task management in computational Grids [C] . Yi Sangho, Kondo Derrick, Kim Bongjae, International Conference on High Performance Computing and Simulation . 2010

机译：在计算网格中使用复制和检查点在可靠的任务管理中
5. Grid resource availability prediction-based scheduling and task replication. [D] . Rood, Brent. 2011

机译：基于网格资源可用性预测的调度和任务复制。
6. Reliable Task Management Based on a Smart Contract for Runtime Verification of Sensing and Actuating Tasks in IoT Environments [O] . Lei Hang, Do-Hyeun Kim 2020

机译：基于智能合约的可靠任务管理用于物联网环境中传感和激励任务的运行时验证
7. Reliable management of checkpointing and application data in opportunistic grids [O] . Raphael Y. de Camargo, Fernando Castor, Fabio Kon 2010

机译：在机会网格中可靠地管理检查点和应用程序数据

Using replication and checkpointing for reliable task management in computational Grids

摘要

著录项

相似文献

相关主题

期刊订阅