Application-level Checkpointing for Shared Memory Programs

Greg Bronevetsky; Daniel Marques; Keshav Pingali; Peter Szwed; Martin Schulz

首页> 外文期刊>Computer architecture news >Application-level Checkpointing for Shared Memory Programs

【24h】

Application-level Checkpointing for Shared Memory Programs

机译：共享内存程序的应用程序级检查点

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (ⅰ) a pre-compiler for source-to-source modification of applications, and (ⅱ) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

机译：高性能计算的趋势使长时间运行的应用程序有必要容忍硬件故障。最常用的方法是检查点和重新启动（CPR）-计算状态会定期保存在磁盘上，并且当发生故障时，将从上次保存的状态重新开始计算。目前，程序员有责任为CPR检测应用程序。我们的小组正在研究使用编译器技术来检测代码，以使其自动检查点和重新启动，从而为使长期运行的科学应用程序能够抵抗硬件故障的问题提供一种自动解决方案。我们以前的工作集中于消息传递程序。在本文中，我们描述了一种用于对称多处理器上运行的共享内存程序的系统。该系统具有两个组件：（ⅰ）用于对应用程序进行源到源修改的预编译器，以及（ⅱ）实现用于在并行应用程序的线程之间协调CPR协议的运行时系统。为了具体起见，我们重点介绍OpenMP的一个重要子集，其中包括障碍和锁定。这种方法的优点之一是，在应用程序本身中嵌入了容错的能力，因此应用程序可以在任何平台上进行自我检查和自我重新启动。我们通过证明我们转换后的基准可以在三个不同的平台（Windows / x86，Linux / x86和Tru64 / Alpha）上检查点并重新启动来证明这一点。我们的实验表明，这种方法引入的开销通常很小。他们还提出了一些方法，可以将当前的实现方式调整为进一步减少开销。

著录项

来源
《Computer architecture news》 |2004年第5期|p.235-247|共13页
作者
Greg Bronevetsky; Daniel Marques; Keshav Pingali; Peter Szwed; Martin Schulz;
展开▼
作者单位

Department of Computer Science Cornell University Ithaca, NY 14853;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
fault-tolerance; checkpointing; shared-memory programs; OpenMP;

机译：容错;检查点;共享内存程序;OpenMP;

相似文献

外文文献
中文文献
专利

1. Application-level Checkpointing for Shared Memory Programs [J] . Greg Bronevetsky, Daniel Marques, Keshav Pingali, Operating systems review . 2004,第5期

机译：共享内存程序的应用程序级检查点
2. In-memory application-level checkpoint-based migration for MPI programs [J] . Ivan Cores, Gabriel Rodriguez, Maria J. Martin, Journal of supercomputing . 2014,第2期

机译：MPI程序基于内存的基于应用程序级检查点的迁移
3. WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs [J] . Xinhai XU, Xuejun YANG, Yufei LIN IEICE transactions on information and systems . 2012,第3期

机译：WBC-ALC：MPI程序的弱阻止协调应用程序级检查点
4. Application-level checkpointing for shared memory programs [C] . Greg Bronevetsky, Daniel Marques, Keshav Pingali, International conference on Architectural support for programming languages and operating systems . 2004

机译：共享内存程序的应用程序级检查点
5. Using lightweight checkpoint/recovery to improve the availability and designability of shared memory multiprocessors. [D] . Sorin, Daniel Jeremy. 2002

机译：使用轻量级检查点/恢复来提高共享内存多处理器的可用性和可设计性。
6. Pivotal role of long non-coding ribonucleic acid-X-inactive specific transcript in regulating immune checkpoint programmed death ligand 1 through a shared pathway between miR-194-5p and miR-155-5p in hepatocellular carcinoma [O] . Sara M Atwa, Heba Handoussa, Karim M Hosny, 2020

机译：长期非编码核糖核核酸-X-无活性转录物在调节免疫检查点的特异性转录物通过肝细胞癌MIR-194-5P和MIR-155-5P之间的共用途径调节免疫检查点的特定转录物。
7. Application-level Checkpointing for Shared Memory Programs [O] . Greg Bronevetsky, Daniel Marques, Keshav Pingali, 2004

机译：共享内存程序的应用程序级检查点

Application-level Checkpointing for Shared Memory Programs

摘要

著录项

相似文献

相关主题

期刊订阅