【24h】

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

机译:可扩展的容错MPI:扩展恢复算法

获取原文
获取原文并翻译 | 示例

摘要

Fault Tolerant MPI (FT-MPI) was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FT-MPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.
机译:容错MPI(FT-MPI)设计为一种解决方案,它允许应用程序通过简单的检查点重新启动方案来使用不同的方法来处理进程故障。 FT-MPI的最初实现包括一个健壮的重量级系统状态恢复算法,该算法设计用于在多个故障期间管理MPI通信器的成员资格。该算法及其实现虽然健壮,但却非常保守,这影响了它在非常大的集群以及分布式系统上的可伸缩性。本文详细介绍了FT-MPI恢复算法以及我们针对新的恢复算法进行的初步实验,这些算法旨在实现可伸缩性和时延耐受性。我们的结论表明,同时使用拓扑感知的集体通信和分布式共识算法可以产生最佳结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号