首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >A amp;#x2018;coolamp;#x2019; way of improving the reliability of HPC machines
【24h】

A amp;#x2018;coolamp;#x2019; way of improving the reliability of HPC machines

机译:A‘ cool’提高HPC机器可靠性的方法

获取原文

摘要

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
机译:能量消耗飙升,伴随着可靠性下降,将织机作为下一代超级计算机的最大障碍。最近的报告表示关切的是,ExaMAgale水平的可靠性可能降低到失败成为常态而不是例外的程度。 HPC研究人员专注于改善现有的容错协议来解决这些问题。改善硬件可靠性的研究,即机器组件可靠性,也在独立进展。在本文中,我们尝试弥合这一差距并探索了组合软件和硬件方面朝着提高HPC机器可靠性的可能性。已知故障率每10°C的核心温度增加一次。我们利用这一概念实验证明抑制核心温度和负载平衡的潜力,以实现两倍的效益:提高并联机器的可靠性并减少应用所需的总执行时间。我们的实验结果表明,我们可以将机器的可靠性提高到2.3倍,并将执行时间减少12%。此外,我们的方案还可以将机器能耗降低多达25%。对于350K套接字机,定期检查点/重启未能进行进度(效率低于1%),而我们的验证模型通过将机器可靠性提高至2.29倍,我们的验证模型预测了20%的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号