Keeping Checkpointing Viable for Exascale Systems.

机译：保持Exascale系统可行的检查点。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Next-generation exascale systems, those capable of performing a quintillion (1018) operations per second, are expected to be delivered in the next 8--10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.

机译：下一代每秒百亿亿次系统有望每秒执行五百亿次（1018）操作，预计将在未来8到10年内交付。这些系统的速度将是当前系统的1000倍，将具有空前的规模。随着这些系统规模的不断扩大，即使经过很小的计算，故障也会变得越来越普遍。因此，诸如容错性和可靠性之类的问题将限制应用程序的可伸缩性。在过去的25年中，确保检查点/重新启动（主要的容错机制）之类的故障进展的当前技术，由于其过多的开销，在未来系统的规模上日益成为问题。在这项工作中，我们评估了许多技术来减少检查点/重新启动的开销，并使该方法对未来的亿亿级系统可行。更具体地说，这项工作评估状态机复制以显着增加检查点间隔（连续检查点之间的时间）和使用图形处理单元减少检查点提交时间（节省一个检查点的时间）的基于散列的概率增量检查点。结合经验分析，建模和模拟，我们在各种参数上研究了这些方法的成本和收益。这些结果涵盖了许多高性能计算能力工作负载，不同的故障分布，硬件平均故障时间以及I / O带宽，这些结果表明了这些技术在满足未来百亿亿次平台可靠性需求方面的潜在优势。

著录项

作者
Ferreira, Kurt B.;
展开▼
作者单位

The University of New Mexico.;

展开▼
授予单位 The University of New Mexico.;
学科 Computer Science.
学位 Ph.D.
年度 2011
页码 180 p.
总页数 180
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems [J] . Muhammad ALFIAN AMRIZAL, Atsuya UNO, Yukinori SATO, IEICE transactions on information and systems . 2017,第12期

机译：Exascale系统的推测检查点的能量性能建模
2. Hybrid Checkpointing Using Emerging Nonvolatile Memories for Future Exascale Systems [J] . XIANGYU DONG, YUAN XIE, NAVEEN MURALIMANOHAR, ACM Transactions on Architecture and Code Optimization . 2011,第2期

机译：使用新兴的非易失性存储器进行未来Exascale系统的混合检查点
3. Coupling Cellular Localization and Function of Checkpoint Kinase 1 (Chk1) in Checkpoints and Cell Viability [J] . Jingna Wang, Xiangzi Han, Xiujing Feng, The Journal of biological chemistry . 2012,第30期

机译：检查点和细胞活力中检查点激酶1（CHK1）的细胞定位和功能
4. Optimizing checkpoint intervals for reduced energy use in exascale systems [C] . Daniel Dauwe, Rohan Jhaveri, Sudeep Pasricha, International Green and Sustainable Computing Conference . 2017

机译：优化检查点间隔，以减少百亿分之一系统的能源消耗
5. Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing. [D] . Abeyratne, Sandunmalee Nilmini. 2017

机译：Exascale计算机体系结构研究：互连，弹性和检查点。
6. The quality of record keeping in primary care: a comparison of computerised paper and hybrid systems. [O] . William T Hamilton, Alison P Round, Deborah Sharp, 2003

机译：初级保健记录保存的质量：计算机纸张和混合系统的比较。
7. Keeping checkpoint/restart viable for exascale systems. [O] . Riesen, Rolf E., Bridges, Patrick G. (IBM Research, Ireland, Mulhuddart, Dublin), Stearley, Jon R., 2011

机译：保持检查点/重启对于exascale系统是可行的。

Keeping Checkpointing Viable for Exascale Systems.

摘要

著录项

相似文献

相关主题

期刊订阅