




Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semi-transparent check-pointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost.Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP.We also present performance data when checkpointing several long-running Paragon applications with CLIP.The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
机译:检查点是用于并行应用程序回滚恢复的有用技术。尽管已经对并行环境中的检查点进行了广泛的研究,但是在商业并行计算机上,很少有检查点可用于应用程序用户。本文提出了一种这样的检查指针: CLIP。 CLIP 是一个用户级库,可为Intel Paragon多计算机上的并行程序提供半透明检查点。它可以免费向Paragon用户公开。从概念上讲,对多台计算机进行检查点非常简单。但是,当创建用于检查复杂机器(如Paragon)的实际工具时,会出现更多问题,需要做出仔细的设计决策。有时必须牺牲易用性以提高效率和/或正确性。本文详细说明了这些决定是什么,以及它们如何在 CLIP中做出。当使用 CLIP检查几个长期运行的Paragon应用程序时,我们还提供了性能数据。底线是方便的通用检查点工具(如 CLIP )可以在大型并行多计算机(如Paragon)上提供容错能力,并且具有非常好的性能。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号