Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems

机译：大型故障HPC系统简单因果留言记录的评估

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporating some form of message logging will provide a framework where only a subset of processors are rolled back after a crash. In this paper, we discuss why a simple causal message logging protocol seems a promising alternative to provide fault tolerance in large supercomputers. As opposed to pessimistic message logging, it has low latency overhead, especially in collective communication operations. Besides, it saves messages when more than one thread is running per processor. Finally, we demonstrate that a simple causal message logging protocol has a faster recovery and a low performance penalty when compared to checkpoint/restart. Running NAS Parallel Benchmarks (CG, MG, BT and DT) on 1024 processors, simple causal message logging has a latency overhead below 5%.

机译：PetaScale Computing的时代带来了数十万处理器的机器。下一代EnaScale超级计算机将提供数百万处理器的可用群集。在这些机器中，故障之间的平均时间范围将从几分钟到几十分钟的范围内，使处理器的崩溃是常见的情况，而不是罕见。在这些大型机器上运行的并行应用需要同时存活崩溃并保持高生产率。为此，容错技术必须超越检查点/重启，这需要所有处理器在故障时滚动。包含某种形式的消息日志记录将提供一个框架，其中仅在崩溃后滚动处理器子集。在本文中，我们讨论了为什么一个简单的因果关系记录协议似乎是有希望的替代方案，可以在大型超级计算机中提供容错。与悲观的消息记录相反，它具有低延迟开销，特别是在集体通信操作中。此外，它会在每个处理器运行多个线程时保存消息。最后，我们证明，与检查点/重启相比，简单的因果关系记录协议具有更快的恢复和低性能惩罚。在1024处理器上运行NAS并行基准（CG，MG，BT和DT），简单的因果留言记录的延迟开销低于5％。

著录项

来源
《IEEE International Symposium on Parallel and Distributed Processing》|2011年||共8页
会议地点
作者
Esteban Meneses; Greg Bronevetsky; Laxmikant V. Kale;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311-53;
关键词
Causal message logging; Pessimistic message logging; Migratable objects; Parallel applications;

机译：因果关系记录;悲观的消息记录;可迁移的对象;并行应用;

相似文献

外文文献
中文文献
专利

1. Adaptive fuzzy finite-time fault-tolerant control for switched nonlinear large-scale systems with actuator and sensor faults [J] . Zhang Jing, Li Shi, Xiang Zhengrong Journal of the Franklin Institute . 2020,第16期

机译：具有执行器和传感器故障的交换非线性大型系统的自适应模糊有限时间容错控制
2. Decentralized fault-tolerant MRAC for a class of large-scale systems with time-varying delays and actuator faults [J] . Deng Chao, Yang Guang-Hong, Er Meng Joo Journal of Process Control . 2019,第期

机译：具有时变延迟和执行器故障的一类大型系统的分散的容错MRAC
3. Adaptive Fault-tolerant Neural Control for Large-scale Systems with Actuator Faults [J] . Gong Jian-Ye, Jiang Bin, Shen Qi-Kun International Journal of Control, Automation, and Systems . 2019,第6期

机译：具有执行器故障的大型系统的自适应容错神经控制
4. Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems [C] . Meneses Esteban, Bronevetsky Greg, Kale Laxmikant V. 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum . 2011

机译：大规模容错HPC系统的简单因果消息记录评估
5. Performance Evaluation of Fault-Tolerant Systems. [D] . Huang, Kun. 2012

机译：容错系统的性能评估。
6. Trust-based fault detection and robust fault-tolerant control of uncertain cyber-physical systems against time-delay injection attacks [O] . Salman Baromand, Amirreza Zaman, Lyudmila Mihaylova 2021

机译：基于信任的故障检测和鲁棒容错控制不确定的网络物理系统免受时间延迟注入攻击的影响
7. Evaluation of simple causal message logging for large-scale fault tolerant hpc systems [O] . Esteban Meneses, Greg Bronevetsky, Laxmikant V. Kalé 2011

机译：评估大规模容错hpc系统的简单因果消息记录

Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems

摘要

著录项

相似文献

相关主题

期刊订阅