首页> 外文会议>ACM/IEEE conference on Supercomputing >A checkpointing strategy for scalable recovery on distributed parallel systems
【24h】

A checkpointing strategy for scalable recovery on distributed parallel systems

机译:分布式并行系统上可伸缩恢复的检查点策略

获取原文
获取外文期刊封面目录资料

摘要

In this paper, we describe a new scheme for checkpointing parallel applications on message-passing scalable distributed memory systems. The novelty of our scheme is that a checkpointed application can be restored, from its checkpointed state, in a reconfigured form. Thus, a parallel application may be checkpointed while executing with t1 tasks on p1 processors, and then restarted from the checkpointed state with t2 tasks on p2 processors. As a result, applications can recover from partial failures in the underlying system. Also, the reconfigurable checkpointed states can be migrated from one parallel system to another even if they do not have the same number of processors. We describe a new programming model for implementing a reconfigurable checkpointing scheme for parallel programs. This new model is derived from the DRMS programming model, developed in the context of run-time reconfiguration of parallel applications. A keycomponent of our implementation is the distribution-independent representation of application array data structures in persistent storage. For further optimizing the performance of checkpoint/restart operations, we provide parallel array section streaming operations for such distributed arrays. We present performance data for the reconfigurable checkpointing and restarting of parallel applications and compare that with the performance of conventional forms of checkpointing. Our results demonstrate the advantages of the new scheme we describe.
机译:在本文中,我们描述了一种用于在消息传递可伸缩分布式存储系统上为并行应用程序检查点的新方案。我们的方案的新颖性在于,可以以重新配置的形式从检查点状态还原检查点应用程序。因此,在 p 1 处理器上使用 t 1 任务执行时,并行应用程序可能会被检查点,然后重新启动从 p 2 处理器上的 t 2 任务从检查点状态开始。结果,应用程序可以从基础系统的部分故障中恢复。同样,可重配置检查点状态可以从一个并行系统迁移到另一个并行系统,即使它们没有相同数量的处理器。我们描述了一种新的编程模型,用于为并行程序实现可重配置的检查点方案。此新模型​​源自DRMS编程模型,该模型是在并行应用程序的运行时重新配置的上下文中开发的。我们实现的一个关键组成部分是持久性存储中应用程序数组数据结构的独立于分布的表示形式。为了进一步优化检查点/重新启动操作的性能,我们为此类分布式阵列提供了并行阵列节流式操作。我们提供了可重新配置的检查点和并行应用程序重新启动的性能数据,并将其与常规形式的检查点的性能进行了比较。我们的结果证明了我们描述的新方案的优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号