首页> 外文学位 >Keeping Checkpointing Viable for Exascale Systems.
【24h】

Keeping Checkpointing Viable for Exascale Systems.

机译:保持Exascale系统可行的检查点。

获取原文
获取原文并翻译 | 示例

摘要

Next-generation exascale systems, those capable of performing a quintillion (1018) operations per second, are expected to be delivered in the next 8--10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.
机译:下一代每秒百亿亿次系统有望每秒执行五百亿次(1018)操作,预计将在未来8到10年内交付。这些系统的速度将是当前系统的1000倍,将具有空前的规模。随着这些系统规模的不断扩大,即使经过很小的计算,故障也会变得越来越普遍。因此,诸如容错性和可靠性之类的问题将限制应用程序的可伸缩性。在过去的25年中,确保检查点/重新启动(主要的容错机制)之类的故障进展的当前技术,由于其过多的开销,在未来系统的规模上日益成为问题。在这项工作中,我们评估了许多技术来减少检查点/重新启动的开销,并使该方法对未来的亿亿级系统可行。更具体地说,这项工作评估状态机复制以显着增加检查点间隔(连续检查点之间的时间)和使用图形处理单元减少检查点提交时间(节省一个检查点的时间)的基于散列的概率增量检查点。结合经验分析,建模和模拟,我们在各种参数上研究了这些方法的成本和收益。这些结果涵盖了许多高性能计算能力工作负载,不同的故障分布,硬件平均故障时间以及I / O带宽,这些结果表明了这些技术在满足未来百亿亿次平台可靠性需求方面的潜在优势。

著录项

  • 作者

    Ferreira, Kurt B.;

  • 作者单位

    The University of New Mexico.;

  • 授予单位 The University of New Mexico.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 180 p.
  • 总页数 180
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号