首页> 外文期刊>Concurrency, practice and experience >EREINIT: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
【24h】

EREINIT: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

机译:EREINIT:适用于批量同步MPI应用程序的可扩展且高效的容错能力

获取原文
获取原文并翻译 | 示例

摘要

Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.
机译:来自许多不同领域的科学家一直在开发批量同步MPI应用程序,以模拟和研究各种科学现象。由于预期在未来的大规模HPC系统中故障率会增加,因此为此类应用程序提供有效的容错机制至关重要。提出了全局重新启动模型,以通过允许MPI的快速重新初始化来减少大容量同步应用程序中的故障恢复时间。但是,该模型的当前实现有几个缺点:它们效率低;它们的可扩展性尚未显示;并且它们需要使用MPI分析界面,从而无法使用工具。在本文中,我们介绍了EReinit,它是解决这些问题的全局重启模型的实现。我们的主要思想和优化是MPI与资源管理器之间的基本容错机制(例如故障检测,通知和恢复)的共同设计,这与仅在MPI中实现这些机制的当前方法形成了对比。我们在三个HPC程序中演示了EReinit,并表明它在4,096个流程上的效率是现有解决方案的四倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号