首页> 外文会议>International Conference ISC High Performance: International Conference on High Performance Computing >Reinit~(++): Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance
【24h】

Reinit~(++): Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

机译:Reinit〜(++):评估用于MPI容错的全局重新启动恢复方法的性能

获取原文

摘要

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest checkpoint. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage.In this paper we present Reinit~(++), a new design and implementation of the Remit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit~(++) contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit~(++) recovers much faster than restarting, up to 6×, or ULFM, up to 3×, and that it scales excellently as the number of MPI processes grows.
机译:由于硬件组件数量的增加,扩展超级计算机的故障率也随之增加。在标准实践中,通过从检查点数据恢复应用程序并在从最新检查点恢复故障后重新开始执行,可以使应用程序具有弹性。但是,重新部署应用程序会导致操作中断和恢复执行,并且可能限制从缓慢的永久存储中进行检查点检索的开销。在本文中,我们提出Reinit〜(++),这是一种针对Remit方法的新设计和实现。全局重新启动恢复,避免了应用程序重新部署。与ULFM领先的MPI容错方法相比,我们广泛评估了Reinit〜(++),实现了全局重启恢复,以及重新启动应用程序以获取对性能的新见解的典型做法。通过对三种不同的HPC代理应用程序进行的实验,它们具有可抵抗过程和节点故障的恢复能力,表明Reinit〜(++)的恢复速度比重新启动要快得多,最高可达6倍,而ULFM最高可达3倍,并且随着数量的增加,它的扩展性也非常好MPI流程的增长。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号