首页> 外文会议>International Conference for High Performance Computing, Networking, Storage and Analysis >A #x2018;cool#x2019; way of improving the reliability of HPC machines
【24h】

A #x2018;cool#x2019; way of improving the reliability of HPC machines

机译:提高HPC机器可靠性的“酷”方法

获取原文

摘要

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
机译:能源消耗飞涨,可靠性不断下降,这已成为下一代超级计算机的最大障碍。最近的报告表示担心,万亿级的可靠性可能会降低到故障成为标准而非例外的地步。 HPC研究人员致力于改善现有的容错协议以解决这些问题。改善硬件可靠性,即机器部件可靠性的研究也已经独立地取得了进展。在本文中,我们试图弥合这一差距,并探索将软件和硬件方面结合起来以提高HPC机器可靠性的潜力。已知,核心温度每升高10°C,故障率就会增加一倍。我们利用此概念通过实验证明了限制核心温度和负载平衡的潜力,从而实现了两项好处:提高并行机的可靠性并减少应用程序所需的总执行时间。我们的实验结果表明,我们可以将机器的可靠性提高2.3倍,并将执行时间减少12%。此外,我们的方案还可以将机器能耗降低多达25%。对于一台350K套接字的机器,常规检查点/重新启动无法取得进展(效率不到1%),而我们经过验证的模型通过将机器的可靠性提高了高达2.29倍而预测了20%的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号