Checkpoint/restart approaches for a thread-based MPI runtime

Adam Julien; Kermarquer Maxime; Besnard Jean-Baptiste; Bautista-Gomez Leonardo; Perache Marc; Carribault Patrick; Jaeger Julien; Malony Allen D.; Shende Sameer

首页> 外文期刊>Parallel Computing >Checkpoint/restart approaches for a thread-based MPI runtime

【24h】

Checkpoint/restart approaches for a thread-based MPI runtime

机译：基于线程的MPI运行时的检查点/重新启动方法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MP1 benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed. (C) 2019 Elsevier B.V. All rights reserved.

机译：在大规模运行大规模并行程序时，容错一直是一个重要的主题。从统计上讲，预计在收集数百万个计算单元的系统上更经常发生硬件和软件故障。而且，工作量越大，崩溃将浪费更多的计算时间。在本文中，我们描述了在MPI运行时中完成的工作，以启用透明和应用程序级检查点机制。与MPI 4.0用户级故障缓解（ULFM）界面不同，我们的工作仅针对Checkpoint / Restart，而忽略其他功能，例如弹性。我们将展示在给定足够的运行时协作的情况下，现有的检查点方法如何切实地应用于基于线程的MPI实现。这两个主要贡献是在透明C / R期间保持高速网络性能，以及归功于专用的用户级调度程序支持，检查点数据复制的超额预订。这些技术在MP1基准（例如IMB，Lulesh和Heatdis）上进行了测量，并讨论了相关的开销和取舍。（C）2019 Elsevier B.V.保留所有权利。

著录项

来源
《Parallel Computing》 |2019年第7期|204-219|共16页
作者
Adam Julien; Kermarquer Maxime; Besnard Jean-Baptiste; Bautista-Gomez Leonardo; Perache Marc; Carribault Patrick; Jaeger Julien; Malony Allen D.; Shende Sameer;
展开▼
作者单位

ParaTools SAS, Bruyeres Le Chatel, France;

CEA, DAM, DIF, F-91297 Arpajon, France;

ParaTools SAS, Bruyeres Le Chatel, France;

Barcelona Supercomp Ctr, Barcelona, Spain;

CEA, DAM, DIF, F-91297 Arpajon, France;

CEA, DAM, DIF, F-91297 Arpajon, France;

CEA, DAM, DIF, F-91297 Arpajon, France;

ParaTools Inc, Eugene, OR USA;

ParaTools Inc, Eugene, OR USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Checkpoint-restart; Fault-tolerance; DMTCP; Infiniband; Multilevel checkpointing; MPI oversubscribing;

机译：检查点重启;容错;DMTCP;Infiniband;多级检查点;MPI过度订阅;

相似文献

外文文献
中文文献
专利

1. Checkpoint/restart approaches for a thread-based MPI runtime [J] . Adam Julien, Kermarquer Maxime, Besnard Jean-Baptiste, Parallel Computing . 2019,第Jula期

机译：检查点/重启基于线程的MPI运行时的方法
2. THE LAM/MPI CHECKPOINT/RESTART FRAMEWORK: SYSTEM-INITIATED CHECKPOINTING [J] . Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, International Journal of High Performance Computing Applications . 2005,第4期

机译：LAM / MPI检查点/重新启动框架：系统初始化的检查点
3. MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection [J] . Hong-Mei Wei, Jian Gao, Peng Qing, 计算机科学技术学报（英文版） . 2020,第002期
4. Co-Designing Multi-Level Checkpoint Restart for MPI Applications [C] . Konstantinos Parasyris, Giorgis Georgakoudis, Leonardo Bautista-Gomez, IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing . 2021

机译：共同设计MPI应用程序的多级检查点重新启动
5. Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems. [D] . Hursey, Joshua. 2010

机译：HPC系统上MPI应用程序的协调检查点/重启过程容错能力。
6. Safety and efficacy of restarting immune checkpoint inhibitors after clinically significant immune-related adverse events in metastatic renal cell carcinoma [O] . Sarah Abou Alaiwi, Wanling Xie, Amin H Nassar, 2020

机译：在转移性肾细胞癌中临床上明显的免疫相关不良事件发生后重启免疫检查点抑制剂的安全性和有效性
7. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing [O] . Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, 2003

机译：LAM / MPI检查点/重新启动框架：系统启动的检查点

Checkpoint/restart approaches for a thread-based MPI runtime

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅