首页> 外文期刊>Concurrency and computation: practice and experience >Tree-based fault-tolerant collective operations for MPI
【24h】

Tree-based fault-tolerant collective operations for MPI

机译:基于树的MPI的容错集体操作

获取原文
获取原文并翻译 | 示例
       

摘要

With the increase in size and complexity of high-performance computing systems, the probability of failures, and the cost of recovery grow. Parallel applications running on these systems should be able to continue running in spite of node failures at arbitrary times. Collective operations are essential for many parallel MPI applications, and are often the first to detect such failures. This work presents tree-based fault-tolerant collective operations, which combine fault detection and recovery as an integral part each operation. We do this by extending existing tree-based algorithms, to allow for a collective operation to succeed despite failing nodes before or during its run. This differs from other approaches, where recovery takes place after a failure of such operations have failed. The article includes a comparison between the performance of the proposed algorithm and other approaches, as well as a simulator-based analysis of performance at scale.
机译:随着高性能计算系统的尺寸和复杂性的增加,故障的可能性以及恢复成本增长。 在这些系统上运行的并行应用程序应尽可能继续运行,尽管是任意次的节点故障。 集体操作对于许多并行MPI应用程序至关重要,通常是第一个检测此类故障的。 这项工作提出了基于树的容错集体操作,将故障检测和恢复与每个操作的整体部分相结合。 我们通过扩展现有的树为基础的算法来执行此操作,以允许集体操作以在运行之前或期间失败。 这与其他方法不同,在这种操作失败失败之后发生恢复发生。 本文包括所提出的算法和其他方法的性能之间的比较,以及基于模拟器的性能分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号