首页> 外文会议>International Symposium on Parallel Distributed Processing >Scalable Group-based Checkpoint/Restart for Large-Scale Message-passing Systems
【24h】

Scalable Group-based Checkpoint/Restart for Large-Scale Message-passing Systems

机译:基于可扩展的基于组的检查点/重启用于大型消息传递系统

获取原文

摘要

The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed.
机译:并行计算机中使用的越来越多的处理器是在大规模并行系统中的容错支持越来越重要。在系统缩放时,我们讨论了用于消息传递应用程序的现有系统级检查点解决方案的不足。我们通过检查点支持,分析两个当前MPI实现的协调成本和阻塞行为。然后提出了一种基于组的解决方案,组合协调检查点和消息记录。实验结果表明其比LAM / MPI和MPICH-VCL更好的性能和可伸缩性。为了协助组形成,提出了一种分析应用程序的通信行为的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号