首页> 外文会议>IEEE International Symposium on Parallel and Distributed Processing >Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems
【24h】

Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems

机译:大型故障HPC系统简单因果留言记录的评估

获取原文

摘要

The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporating some form of message logging will provide a framework where only a subset of processors are rolled back after a crash. In this paper, we discuss why a simple causal message logging protocol seems a promising alternative to provide fault tolerance in large supercomputers. As opposed to pessimistic message logging, it has low latency overhead, especially in collective communication operations. Besides, it saves messages when more than one thread is running per processor. Finally, we demonstrate that a simple causal message logging protocol has a faster recovery and a low performance penalty when compared to checkpoint/restart. Running NAS Parallel Benchmarks (CG, MG, BT and DT) on 1024 processors, simple causal message logging has a latency overhead below 5%.
机译:PetaScale Computing的时代带来了数十万处理器的机器。下一代EnaScale超级计算机将提供数百万处理器的可用群集。在这些机器中,故障之间的平均时间范围将从几分钟到几十分钟的范围内,使处理器的崩溃是常见的情况,而不是罕见。在这些大型机器上运行的并行应用需要同时存活崩溃并保持高生产率。为此,容错技术必须超越检查点/重启,这需要所有处理器在故障时滚动。包含某种形式的消息日志记录将提供一个框架,其中仅在崩溃后滚动处理器子集。在本文中,我们讨论了为什么一个简单的因果关系记录协议似乎是有希望的替代方案,可以在大型超级计算机中提供容错。与悲观的消息记录相反,它具有低延迟开销,特别是在集体通信操作中。此外,它会在每个处理器运行多个线程时保存消息。最后,我们证明,与检查点/重启相比,简单的因果关系记录协议具有更快的恢复和低性能惩罚。在1024处理器上运行NAS并行基准(CG,MG,BT和DT),简单的因果留言记录的延迟开销低于5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号