【24h】

Avoiding checkpoint contamination in parallel systems

机译:避免并行系统中检查点污染

获取原文

摘要

Checkpointing and rollback recovery is a very effective technique to tolerate faults, provided the application is able to recover from a previous checkpoint and proceed with a failure-free computation. However, this technique may fall short if the checkpoint files are somehow contaminated by errors. This paper presents two mechanisms that may be used to determine if a committed checkpoint is error-free or not. These techniques can be used simultaneously for error detection and failure recovery. Both of them are based on checkpoint duplication: one makes use of spatial redundancy while the other is based on temporal redundancy. We discuss the main problems and trade-offs that have to be dealt with to implement these techniques. We then present a performance study that clearly shows the pros and cons of each one. As far as we know, this paper presents the first implementation of these mechanisms in a standard parallel computing system.
机译:如果应用程序能够从先前的检查点恢复并进行无故障计算,则检查点和回滚恢复是一种非常有效的容错技术。但是,如果检查点文件因某种原因被错误污染,则该技术可能无法使用。本文提出了两种机制,可用于确定已提交的检查点是否无错误。这些技术可以同时用于错误检测和故障恢复。它们都基于检查点重复:一种利用空间冗余,而另一种则基于时间冗余。我们讨论了实现这些技术必须解决的主要问题和权衡取舍。然后,我们进行一项性能研究,清楚地表明了每个方案的优缺点。据我们所知,本文介绍了这些机制在标准并行计算系统中的首次实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号