首页> 外文期刊>Future generation computer systems >Accelerating incremental checkpointing for extreme-scale computing
【24h】

Accelerating incremental checkpointing for extreme-scale computing

机译:加速增量检查点以进行超大规模计算

获取原文
获取原文并翻译 | 示例
           

摘要

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.
机译:高性能计算(HPC)社区开始对未来大型系统的可靠性产生担忧。在过去的30年中,基于磁盘的协调检查点/重新启动一直是HPC系统中主要的容错机制。检查点性能对于可伸缩性至关重要,因此几乎所有功能应用程序都具有自定义检查点策略,以最大程度地减少状态并减少检查点时间。传统检查点/重新启动的一种众所周知的优化是增量检查点,它具有许多已知的局限性。为了解决这些限制,我们描述了libhashckpt,这是一种混合增量检查点解决方案,它在GPU上同时使用页面保护和哈希来确定应用程序数据中的更改,而开销却非常低。使用实际能力的工作负载和概述该技术的可行性和应用程序效率提高的模型,我们表明,在对未来极端系统预期的规模上,基于散列的增量检查点可以比传统的协调检查点方法显着降低开销,并提高效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号