...
首页> 外文期刊>Computer architecture news >Application-level Checkpointing for Shared Memory Programs
【24h】

Application-level Checkpointing for Shared Memory Programs

机译:共享内存程序的应用程序级检查点

获取原文
获取原文并翻译 | 示例
           

摘要

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (ⅰ) a pre-compiler for source-to-source modification of applications, and (ⅱ) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
机译:高性能计算的趋势使长时间运行的应用程序有必要容忍硬件故障。最常用的方法是检查点和重新启动(CPR)-计算状态会定期保存在磁盘上,并且当发生故障时,将从上次保存的状态重新开始计算。目前,程序员有责任为CPR检测应用程序。我们的小组正在研究使用编译器技术来检测代码,以使其自动检查点和重新启动,从而为使长期运行的科学应用程序能够抵抗硬件故障的问题提供一种自动解决方案。我们以前的工作集中于消息传递程序。在本文中,我们描述了一种用于对称多处理器上运行的共享内存程序的系统。该系统具有两个组件:(ⅰ)用于对应用程序进行源到源修改的预编译器,以及(ⅱ)实现用于在并行应用程序的线程之间协调CPR协议的运行时系统。为了具体起见,我们重点介绍OpenMP的一个重要子集,其中包括障碍和锁定。这种方法的优点之一是,在应用程序本身中嵌入了容错的能力,因此应用程序可以在任何平台上进行自我检查和自我重新启动。我们通过证明我们转换后的基准可以在三个不同的平台(Windows / x86,Linux / x86和Tru64 / Alpha)上检查点并重新启动来证明这一点。我们的实验表明,这种方法引入的开销通常很小。他们还提出了一些方法,可以将当前的实现方式调整为进一步减少开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号