Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

Zhu Lei; Gu Jianhua; Wang Yunlan; Zhao Tianhai; Cai Zhennao

首页> 外文期刊>Journal of supercomputing >Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

【24h】

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

机译：使用预测和多个主动措施优化HPC系统的容错开销

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The complexity and scale of high-performance computer systems are rapidly increasing, so fault tolerance is becoming a critical challenge. In this paper, we consider the impact of multiple proactive actions on proactive fault tolerance and periodic checkpointing. We extended Aupy's model in the presence of multiple proactive actions, including proactive checkpointing and task migration. We then propose optimal strategies for deciding when to trust predictions, and provide algorithms for the optimal storage interval for periodic checkpointing. The results show that the proposed method can significantly improve system productivity. Our case study indicates that the recall of the predictor is more important for small platforms, and that precision becomes increasingly important as the scale of the system increases.

机译：高性能计算机系统的复杂性和规模正在迅速增加，因此容错能力正成为一项严峻的挑战。在本文中，我们考虑了多个主动措施对主动容错和定期检查点的影响。我们在存在多个主动动作（包括主动检查点和任务迁移）的情况下扩展了Aupy模型。然后，我们提出了确定何时信任预测的最佳策略，并提供了用于定期检查点的最佳存储间隔的算法。结果表明，该方法可以显着提高系统生产率。我们的案例研究表明，对于小型平台，预测器的召回更为重要，并且随着系统规模的增加，精度也变得越来越重要。

著录项

来源
《Journal of supercomputing》 |2015年第10期|3668-3694|共27页
作者
Zhu Lei; Gu Jianhua; Wang Yunlan; Zhao Tianhai; Cai Zhennao;
展开▼
作者单位

Northwestern Polytech Univ, Sch Comp, Xian 710072, Shaanxi, Peoples R China;

Northwestern Polytech Univ, Sch Comp, Xian 710072, Shaanxi, Peoples R China;

Northwestern Polytech Univ, Sch Comp, Xian 710072, Shaanxi, Peoples R China;

Northwestern Polytech Univ, Sch Comp, Xian 710072, Shaanxi, Peoples R China;

Northwestern Polytech Univ, Sch Comp, Xian 710072, Shaanxi, Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
High-performance computer (HPC); Checkpoint/ restart; Proactive fault tolerance; Resilience; Prediction;

机译：高性能计算机（HPC）;检查点/重新启动;主动容错;弹性;预测;

相似文献

外文文献
中文文献
专利

1. Dynamic Data Management Among Multiple Databases for Optimization of Parallel Computations in Heterogeneous HPC Systems [J] . Pawe? Rosciszewski Computer Science & Information Technology . 2014,第7b期

机译：多个数据库之间的动态数据管理，以优化异构HPC系统中的并行计算
2. Prediction of effective reaction rates in catalytic systems of multiple reactions using one-dimensional models [J] . Taulamet M. J., Mariani N. J., Martinez O. M., Chemical engineering journal . 2018,第期

机译：用一维模型预测多重反应催化体系的有效反应速率
3. Evaluation of multiple muscle loads through multi-objective optimization with prediction of subjective satisfaction level: Illustration by an application to handrail position for standing [J] . Takanori Chihara, Akihiko Seo Applied Ergonomics . 2014,第2aPta2期

机译：通过多目标优化和主观满意度的预测来评估多种肌肉负荷：通过站立扶手位置的应用说明
4. Optimization for Fractional Cooperation in Multiple-Source Multiple-Relay Systems [C] . Josephine P. K. Chu, Andrew W. Eckford, Raviraj S. Adve IEEE International Conference on Communications . 2009

机译：多源多中继系统中分数协作的优化
5. Advanced methods for prediction of animal-related outages in overhead distribution systems [D] . Gui, Min 2009

机译：预测架空配电系统中与动物有关的停机的先进方法
6. 30B. An Integrative Systems-biology Approach to Autoimmune Disease: Leaving the Era of Reaction and Entering the New Proactive Era of Prediction [O] . David Brady 2013

机译：30B。自身免疫性疾病的综合系统生物学方法：离开反应时代进入新的主动预测时代
7. DYNAMIC DATA MANAGEMENT AMONG MULTIPLE DATABASES FOR OPTIMIZATION OF PARALLEL COMPUTATIONS IN HETEROGENEOUS HPC SYSTEMS [O] . Paweł Rościszewski 2015

机译：用于优化异构HpC系统中并行计算优化的多个数据库的动态数据管理
8. Optimization of a Flow Injection Analysis System for Multiple Solvent Extraction [R] . Rossi, T. M., Shelly, D. C., Warner, I. M. 1982

机译：多溶剂萃取流动注射分析系统的优化

Optimizing the fault-tolerance overheads of HPC systems using prediction and multiple proactive actions

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅