首页> 外文期刊>Concurrency, practice and experience >The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints
【24h】

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

机译:几乎出乎意料的优点:利用MPI集合操作来大致协调检查点

获取原文
获取原文并翻译 | 示例
           

摘要

Coordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large-scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next-generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.
机译:当前,协调检查点/重新启动是缓解故障对在大型分布式系统上运行的重要科学应用程序的影响的主要方法。但是,有广泛的证据表明,在下一代系统上,协调检查点可能不再可行。非协调检查点/重新启动尝试通过允许应用程序进程独立检查点状态来解决协调检查点/重新启动的缺点。但是,消除协调可能会严重降低应用程序性能。在本文中,我们提出了一种在重要的科学应用中利用现有协调来近似协调检查点的方法。具体来说,我们建议扩展MPI实现,以强制检查点在集体操作完成后立即发生。我们使用现有的经过验证的仿真框架评估这种方法对性能的影响。我们的结果表明,相对于完全不协调的检查点,近似协调的检查点可以显着提高应用程序性能。我们还表明,强制执行集体操作后的检查点对几个重要工作负载的名义检查点间隔影响很小。总体而言,本文提出的结果表明,大致协调的检查点可能会提供显着的性能优势,而不会显着增加故障恢复的成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号