首页> 外文会议>IEEE International Parallel Distributed Processing Symposium >HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
【24h】

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

机译:淡水:故障遏制,无需对大型发送确定的MPI应用程序进行事件记录

获取原文

摘要

High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (checkpointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated checkpointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the senddeterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.
机译:高性能计算可能会在这十年内达到ExaScale。在这种规模,预计失败之间的平均时间将是几个小时。由于在故障(检查点协议)之后需要全局重启或导致巨大的内存占用(消息记录)来说,消息传递应用程序的现有故障容错协议不会高效。混合容错协议通过将应用程序流程划分为集群并在集群内和之间应用不同的协议来克服这些限制。组合协调检查点在群集内部和群集帧间消息的记录允许将故障的后果限制在单个集群上,同时仅记录消息的子集。但是,在现有的混合协议中,所有应用程序消息都需要事件日志记录,以确保在失败后正确执行。这可能会显着损害无故障性能。在本文中,我们提出了一个Hydee,一种混合​​回滚恢复协议,用于发送确定的消息传递应用程序,在不记录任何事件的情况下提供故障密封,并且只有应用程序消息的子集。我们证明了卫生部可以通过依靠SendDeterMinistic执行模型来处理多个并发失败。我们在MPICH2图书馆中我们实施WEDEE的实验评估表明,它几乎在故障执行时介绍了几乎没有开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号