Correlated set coordination in fault tolerant message logging protocols for many-core clusters

Aurelien Bouteiller; Thomas Herault; George Bosilca; Jack J. Dongarra

首页> 外文期刊>Concurrency and Computation >Correlated set coordination in fault tolerant message logging protocols for many-core clusters

【24h】

Correlated set coordination in fault tolerant message logging protocols for many-core clusters

机译：多核集群的容错消息日志记录协议中的相关集合协调

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

With our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes.

机译：根据目前对由数十万个多核节点组成的亿亿级系统的期望，即使在最乐观的假设下，两次故障之间的平均时间也会变小。当每个节点的核心数量增加时，由于节省消息有效负载的高开销，最可伸缩的检查点重新启动技术之一就是消息记录方法。幸运的是，对于同一节点上的两个进程，故障概率是相关的，这意味着协调恢复是免费的。在本文中，我们提出了一种中间方法，该方法使用相关进程之间的协调，但保留了独立进程之间消息记录的可伸缩性优势。该算法仍属于事件日志记录协议系列，但无需在协调的流程之间进行昂贵的有效负载日志记录。

著录项

来源
《Concurrency and Computation》 |2013年第4期|572-585|共14页
作者
Aurelien Bouteiller; Thomas Herault; George Bosilca; Jack J. Dongarra;
展开▼
作者单位

Innovative Computing Laboratory, 1122 Volunteer Blvd., 37996 Knoxville, TN, USA;

Innovative Computing Laboratory, The University of Tennessee, USA;

Innovative Computing Laboratory, The University of Tennessee, USA;

Innovative Computing Laboratory, The University of Tennessee, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
fault tolerance; multicore clusters; checkpoint/restart;

机译：容错多核集群;检查点/重启;

相似文献

外文文献
中文文献
专利

1. Fault-tolerant adaptive routing under an unconstrained set of node and link failures for many-core systems-on-chip [J] . Michael Dimopoulos, Yi Gang, Lorena Anghel, Microprocessors and microsystems . 2014,第6期

机译：无约束的节点和链路故障集下的多核片上系统的容错自适应路由
2. Message-optimal protocols for fault-tolerant broadcasts/multicasts in distributed systems with crash failures [J] . Hong-Yi Tzeng, Kai-Yeung Siu IEEE Transactions on Computers . 1995,第2期

机译：具有崩溃故障的分布式系统中的容错广播/组播的消息最佳协议
3. Hybrid Message Pessimistic Logging. Improving current pessimistic message logging protocols [J] . Hugo Meyer, Ronal Muresano, Marcela Castro-León, Journal of Parallel and Distributed Computing . 2017,第JUNa期

机译：混合消息悲观日志记录。改进当前的悲观消息记录协议
4. Correlated Set Coordination in Fault Tolerant Message Logging Protocols [C] . Aurelien Bouteiller, Thomas Herault, George Bosilca, International Euro-Par conference;Euro-Par 2011 . 2011

机译：容错消息记录协议中的相关集协调
5. QoS and fault tolerant distributed channel allocation protocols for wireless and mobile networks. [D] . Abrougui, Kaouther. 2006

机译：无线和移动网络的QoS和容错分布式信道分配协议。
6. Cluster-Fault Tolerant Routing in a Torus [O] . Antoine Bossard, Keiichi Kaneko 2020

机译：圆环中的群集容错路由
7. Correlated set coordination in fault tolerant message logging protocols [O] . Aurelien Bouteiller, Thomas Herault, George Bosilca, 2011

机译：容错消息记录协议中的相关集协调

Correlated set coordination in fault tolerant message logging protocols for many-core clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅