首页> 外文期刊>Concurrency and Computation >Correlated set coordination in fault tolerant message logging protocols for many-core clusters
【24h】

Correlated set coordination in fault tolerant message logging protocols for many-core clusters

机译:多核集群的容错消息日志记录协议中的相关集合协调

获取原文
获取原文并翻译 | 示例

摘要

With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.
机译:根据目前对由数十万个多核节点组成的亿亿级系统的期望,即使在最乐观的假设下,两次故障之间的平均时间也会变小。当每个节点的核心数量增加时,由于节省消息有效负载的高开销,最可伸缩的检查点重新启动技术之一就是消息记录方法。幸运的是,对于同一节点上的两个进程,故障概率是相关的,这意味着协调恢复是免费的。在本文中,我们提出了一种中间方法,该方法使用相关进程之间的协调,但保留了独立进程之间消息记录的可伸缩性优势。该算法仍属于事件日志记录协议系列,但无需在协调的流程之间进行昂贵的有效负载日志记录。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号