首页> 外文期刊>Parallel Computing >Checkpoint/restart approaches for a thread-based MPI runtime
【24h】

Checkpoint/restart approaches for a thread-based MPI runtime

机译:基于线程的MPI运行时的检查点/重新启动方法

获取原文
获取原文并翻译 | 示例

摘要

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MP1 benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed. (C) 2019 Elsevier B.V. All rights reserved.
机译:在大规模运行大规模并行程序时,容错一直是一个重要的主题。从统计上讲,预计在收集数百万个计算单元的系统上更经常发生硬件和软件故障。而且,工作量越大,崩溃将浪费更多的计算时间。在本文中,我们描述了在MPI运行时中完成的工作,以启用透明和应用程序级检查点机制。与MPI 4.0用户级故障缓解(ULFM)界面不同,我们的工作仅针对Checkpoint / Restart,而忽略其他功能,例如弹性。我们将展示在给定足够的运行时协作的情况下,现有的检查点方法如何切实地应用于基于线程的MPI实现。这两个主要贡献是在透明C / R期间保持高速网络性能,以及归功于专用的用户级调度程序支持,检查点数据复制的超额预订。这些技术在MP1基准(例如IMB,Lulesh和Heatdis)上进行了测量,并讨论了相关的开销和取舍。 (C)2019 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号