首页>
外国专利>
Fault tolerance and recovery in high performance computing (HPC) systems
Fault tolerance and recovery in high performance computing (HPC) systems
展开▼
机译:高性能计算(HPC)系统中的容错和恢复
展开▼
页面导航
摘要
著录项
相似文献
摘要
In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.
展开▼