首页> 外文会议>The 39th International Conference on Parallel Processing >Optimizing HPC Fault-Tolerant Environment: An Analytical Approach
【24h】

Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

机译:优化HPC容错环境:一种分析方法

获取原文

摘要

The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the number of nodes, failure arrival rate, recovery cost, and checkpointing interval and overhead. Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures. Performance scalability under failures is also studied to explore the performance improvement space for different parameters. Experimental results from both synthetic and actual system failure logs confirm that the proposed model and optimization methodologies are effective and feasible.
机译:现代高性能计算(HPC)系统的集成规模越来越大,从而大大增加了发生故障的可能性。故障下的性能及其优化成为HPC社区面临的及时重要问题。在这项研究中,我们提出了一个分析模型来预测应用程序性能。该模型描述了协调检查点和系统故障对应用程序性能的影响,并考虑了所有因素,包括工作量,节点数,故障到达率,恢复成本以及检查点间隔和开销。基于该模型,我们测量三个参数,即计算节点数,检查点间隔和备用节点数,以对故障情况下的性能优化进行全面研究。还研究了故障下的性能可伸缩性,以探索不同参数的性能改进空间。综合和实际系统故障日志的实验结果证实,所提出的模型和优化方法是有效可行的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号