首页> 外文期刊>Future generation computer systems >Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
【24h】

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

机译:大规模容错MPI协议的阻塞与非阻塞协调检查点

获取原文

摘要

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and non-blocking. However they have never been compared quantitatively, and their respective scalabilities remain unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalabilities. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.
机译:高性能计算的长期趋势是并行计算平台中节点的数量不断增加,这带来了更高的故障概率。容错编程环境应用于保证关键应用程序的安全执行。对容错MPI的研究导致了几种容错MPI环境的发展。基于协调的检查点或消息记录,使用各种容错消息传递协议提出了不同的方法。最受欢迎的方法是协调检查点。在文献中,已经提出了两种不同的协调检查点概念:阻塞和非阻塞。但是,它们从未进行过定量比较,其可扩展性仍然未知。本文的作用是提供这两种方法之间的首次比较,并研究它们的可扩展性。我们已经在MPICH环境中实现了这两种方法,并使用NAS并行基准评估了它们的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号