首页> 外文期刊>Parallel Computing >Checkpoint/restart approaches for a thread-based MPI runtime
【24h】

Checkpoint/restart approaches for a thread-based MPI runtime

机译:检查点/重启基于线程的MPI运行时的方法

获取原文
获取原文并翻译 | 示例

摘要

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MP1 benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed. (C) 2019 Elsevier B.V. All rights reserved.
机译:在缩小速度下运行大规模并行程序时,容错始终是一个重要的主题。统计上,硬件和软件故障预计会在收集数百万计算单元的系统上更频繁地发生。此外,较大的工作是,崩溃浪费的计算时间越多。在本文中,我们描述了在MPI运行时所做的工作,以实现透明和应用程序级检查点机制。与MPI 4.0用户级故障缓解(ULFM)接口不同,我们的工作目标仅仅检查/重新启动并忽略其他功能,如弹性。我们展示了在给定足够的运行时协作的基于线程的MPI实现上的现有检查点的方法。由于专用的用户级调度程序支持,这两个主要贡献是在透明的C / R期间保存高速网络性能以及检查点数据复制的过度订阅。这些技术在MP1基准上测量,例如IMB,LULESH和SHICDIS,并且讨论了相关的开销和权衡。 (c)2019 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号