首页> 外文期刊>Cluster computing >Reverse computation for rollback-based fault tolerance in large parallel systems
【24h】

Reverse computation for rollback-based fault tolerance in large parallel systems

机译:大型并行系统中基于回滚的容错的反向计算

获取原文
获取原文并翻译 | 示例
       

摘要

Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle (ideal gas) simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.
机译:逆向计算在这里提出,是解决大型群集平台上并行计算的容错执行挑战的重要未来方向。随着并行作业规模的增加,传统的检查点方法会遇到可伸缩性问题,从计算速度降低到检查点在持久性存储中的高度拥塞。反向计算可以克服这些问题,并且也更适合于具有更小,更便宜或更节能的内存和文件系统的较新架构上的并行计算。大型系统中进行反向计算的可行性的初步证据来自详细的性能数据,这些性能数据来自将粒子(理想气体)模拟扩展到65,536个处理器核和950个加速器(GPU)的过程。当节点依赖其主机处理器/内存来容忍其加速器的故障时,相对于检查点方案,可以观察到反向计算可带来非常大的收益。反向计算与检查点之间的比较与诸如高速缓存未命中率,TLB未命中和内存使用率之类的测量结果表明,反向计算作为新兴架构中未来的替代方案将难以忽视。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号