首页> 外文期刊>IEEE transactions on dependable and secure computing >Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters
【24h】

Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters

机译:HPC互连的弹性:蓝水中互连故障和恢复的案例研究

获取原文
获取原文并翻译 | 示例
       

摘要

Availability of the interconnection network in high-performance computing (HPC) systems is fundamental to sustaining the continuous execution of applications at scale. When failures occur, interconnect recovery mechanisms orchestrate complex operations to recover network connectivity between the nodes. As the scale and design complexity of HPC systems increase, so does the system's susceptibility to failures during execution of interconnect-recovery procedures. This study characterizes the recovery procedures of the Gemini interconnect network, the largest Gemini network built by Cray, on Blue Waters, a 13.3 petaflop supercomputer at the National Center for Supercomputing Applications (NCSA). We propose a propagation model that captures interconnect failures and recovery procedures to help understand types of failures and their propagation in both the system and applications during recovery. The measurements show that recovery procedures occur very frequently and that the unsuccessful execution of recovery procedures, when additional failures occur during recovery, causes system-wide outages (SWOs, 28 out of 101) and application failures (3.4 percent of all running applications).
机译:高性能计算(HPC)系统中互连网络的可用性对于维持规模规模的应用程序的连续执行至关重要。发生故障时,互连恢复机制会协调复杂的操作以恢复节点之间的网络连接。随着HPC系统的规模和设计复杂性的增加,在执行互连恢复程序期间,系统对故障的敏感性也随之增加。这项研究的特点是在国家超级计算应用中心(NCSA)上13.3 petaflop超级计算机Blue Waters上,由Cray建立的最大的Gemini网络Gemini互连网络的恢复过程。我们提出了一个传播模型,该模型可以捕获互连故障和恢复过程,以帮助了解故障类型及其在恢复过程中在系统和应用程序中的传播。这些测量表明,恢复过程非常频繁,并且恢复过程的执行不成功,当恢复期间发生其他故障时,会导致系统范围内的中断(SWO,101个中的28个)和应用程序故障(占所有正在运行的应用程序的3.4%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号