首页> 外国专利> High-performance computing and Recovery Fault Tolerance in (HPC) systems

High-performance computing and Recovery Fault Tolerance in (HPC) systems

机译:(HPC)系统中的高性能计算和恢复容错

摘要

In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.
机译:在一个实施例中,一种用于高性能计算(HPC)系统中的容错和恢复的方法包括监视包括多个节点的HPC系统中的当前正在运行的节点。一种将多个节点彼此耦合并将多个节点耦合到多个节点中的每个节点可访问的存储并且能够存储多个主机的存储结构,每个主机可以在多个节点中的任何一个上执行。该方法包括:如果当前正在运行的节点发生故障,则停止当前正在运行的节点的操作,并从存储中在HPC系统的空闲节点处引导主机。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号