首页> 外文会议>2012 IEEE 26th International Parallel and Distributed Processing Symposium >HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications
【24h】

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

机译:HydEE:用于大型发送确定性MPI应用程序的无事件记录的故障排除

获取原文
获取原文并翻译 | 示例

摘要

High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (check pointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated check pointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the send-deterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.
机译:高性能计算可能会在此十年达到百亿亿美元。以这种规模,平均两次故障之间的时间预计为几个小时。用于消息传递应用程序的现有容错协议将不再有效,因为它们要么在发生故障后需要全局重新启动(检查指向协议),要么会占用大量内存(消息记录)。混合容错协议通过将应用程序进程划分为多个集群,并在集群内部和集群之间应用不同的协议,从而克服了这些限制。将群集内的协调检查点与群集间消息的消息记录结合起来,可以将故障后果限制在单个群集中,而只记录部分消息。但是,在现有的混合协议中,所有应用程序消息都需要事件日志记录,以确保发生故障后正确执行。这会严重损害无故障性能。在本文中,我们提出HydEE,这是一种用于发送确定性消息传递应用程序的混合回滚恢复协议,该协议提供故障遏制而无需记录任何事件,并且仅记录应用程序消息的一个子集。我们证明,HydEE可以依靠发送确定性执行模型来处理多个并发故障。对我们在MPICH2库中实施HydEE的实验评估表明,它几乎不会带来无故障执行的开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号