Reverse computation for rollback-based fault tolerance in large parallel systems

Kalyan S. Perumalla; Alfred J. Park

首页> 外文期刊>Cluster computing >Reverse computation for rollback-based fault tolerance in large parallel systems

【24h】

Reverse computation for rollback-based fault tolerance in large parallel systems

机译：大型并行系统中基于回滚的容错的反向计算

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.

机译：逆向计算在这里提出，是解决大型群集平台上并行计算的容错执行挑战的重要未来方向。随着并行作业规模的增加，传统的检查点方法会遇到可伸缩性问题，从计算速度降低到检查点在持久性存储中的高度拥塞。反向计算可以克服这些问题，并且也更适合于具有更小，更便宜或更节能的内存和文件系统的较新架构上的并行计算。大型系统中进行反向计算的可行性的初步证据来自详细的性能数据，这些性能数据来自将粒子（理想气体）模拟扩展到65,536个处理器核和950个加速器（GPU）的过程。当节点依赖其主机处理器/内存来容忍其加速器的故障时，相对于检查点方案，可以观察到反向计算可带来非常大的收益。反向计算与检查点之间的比较与诸如高速缓存未命中率，TLB未命中和内存使用率之类的测量结果表明，反向计算作为新兴架构中未来的替代方案将难以忽视。

著录项

来源
《Cluster computing》 |2014年第2期|共11页
作者
Kalyan S. Perumalla; Alfred J. Park;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类分子生物学;
关键词
Checkpointing; Rollback; Reverse computation; Performance evaluation; Parallel; Systems; Fault tolerance;

机译：检查点;回滚;反向计算;性能评估;并行;系统;容错;

相似文献

外文文献
中文文献
专利

1. Reverse computation for rollback-based fault tolerance in large parallel systems [J] . Kalyan S. Perumalla, Alfred J. Park Cluster computing . 2014,第2期

机译：大型并行系统中基于回滚的容错的反向计算
2. Redundantization of Interdependent Program Modules for Parallel Control Computing Systems: Organization, Estimation of Fault-tolerance, Formalized Description [J] . V. V. Ignatushchenko, N. A. Isaeva Automation and Remote Control . 2008,第10期

机译：并行控制计算系统的相互依赖程序模块的冗余化：组织，容错估计，形式化描述
3. Estimation of Fault-tolerance of the Parallel Control Computing Systems: A New Approach [J] . V. V. Eliseev, V. V. Ignatushchenko, I. Yu. Podshivalova Automation and Remote Control . 2007,第6期

机译：并行控制计算系统的容错估计：一种新方法
4. Parallelization and fault-tolerance of evolutionary computation on many-core processors [C] . Sato Yuji, Sato Mikiko IEEE Congress on Evolutionary Computation . 2013

机译：多核处理器上进化计算的并行化和容错
5. A unified framework for transparent parallelism and fault-tolerance in distributed systems. [D] . Yoo, Sunghwan. 2014

机译：分布式系统中透明并行性和容错性的统一框架。
6. Wireless Sensor Networks Fault-Tolerance Based on Graph Domination with Parallel Scatter Search [O] . Abdel-Rahman Hedar, Shada N. Abdulaziz, Emad Mabrouk, 2020

机译：基于图形控制和并行散点搜索的无线传感器网络容错
7. A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems [O] . Treaster, Michael 2004

机译：并联机器人容错与故障恢复技术综述系统
8. System Level Fault Tolerance in Parallel and Distributed Computing Systems [R] . 1993

机译：并行和分布式计算系统中的系统级容错

Reverse computation for rollback-based fault tolerance in large parallel systems

摘要

著录项

相似文献

相关主题

期刊订阅