首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids
【24h】

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

机译:自适应任务检查点和复制:建立高效的容错网格

获取原文
获取原文并翻译 | 示例

摘要

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.
机译:网格是一种分布式计算和存储环境,通常由异构的自治管理子系统组成。结果,变化的资源可用性变得司空见惯,常常导致丢失和延迟执行作业。为了确保良好的电网性能,应考虑容错能力。在分布式系统中提供容错能力的常用技术是定期作业检查点和复制。尽管非常健壮,但是如果选择了不合适的检查点间隔和副本编号,则这两种技术都会延迟作业执行。本文介绍了几种启发式方法,这些方法可根据网格状态信息动态调整上述参数,以在出现故障的情况下提供较高的作业吞吐量,同时减少系统开销。此外,提出了一种新颖的结合了检查点和复制的容错算法。在一种新开发的网格仿真环境中,在分布式环境中动态调度(DSiDE)中对提出的方法进行了评估,从而可以轻松地对动态系统和工作行为建模。使用从几个大型并行生产系统收集的日志中导出的工作负载和系统参数进行模拟。实验表明,自适应方法可以显着提高系统性能,而对一种解决方案的偏好取决于特定的系统特性,例如负载,作业提交模式和故障频率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号