首页> 外文期刊>Journal of Information Recording >Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems
【24h】

Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

机译:高性能集群计算系统的自适应两级阻塞协调检查点

获取原文
获取原文并翻译 | 示例
       

摘要

Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce a new approach for blocking coordinated checkpointing using two-level checkpointing. The first level of checkpointing is local checkpointing, and computing nodes save the checkpoints in local disk. If a transient failure occurs in the computing node, the process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to highly reliable global stable storage. If a permanent failure occurs in the computing node, it can not be used and the process can recover from global storage in a new computing node. Local checkpoints are taken more frequently than global checkpoints. Also, in the end of each local checkpointing interval, the system determines the expected recovery time in the case of permanent failure and adaptively takes a global checkpoint, or skips. Experimental results show that average execution time of NAS-BT application is significantly reduced by using the proposed method. Maximum reduction of execution time of this application is 38%.
机译:阻塞式协调检查点是一种在群集计算系统中实现容错能力的众所周知的方法。在这项工作中,我们介绍了一种使用两级检查点阻止协调检查点的新方法。检查点的第一级是本地检查点,计算节点将检查点保存在本地磁盘中。如果在计算节点中发生短暂故障,则该过程可以从本地磁盘恢复。检查点的第二层是全局检查点,计算节点将其检查点发送到高度可靠的全局稳定存储。如果在计算节点中发生永久性故障,则无法使用它,并且该过程可以从新计算节点中的全局存储中恢复。本地检查点比全局检查点更频繁地使用。同样,在每个本地检查点间隔的末尾,系统会在永久性故障的情况下确定预期的恢复时间,并自适应地采用全局检查点或跳过。实验结果表明,该方法可以显着缩短NAS-BT应用的平均执行时间。该应用程序的最大执行时间减少了38%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号