首页>
外国专利>
High-performance computing and Recovery Fault Tolerance in (HPC) systems
High-performance computing and Recovery Fault Tolerance in (HPC) systems
展开▼
机译:(HPC)系统中的高性能计算和恢复容错
展开▼
页面导航
摘要
著录项
相似文献
摘要
In one embodiment, a method for fault tolerance and recovery in a high-performance computing (HPC) system includes monitoring a currently running node in an HPC system including multiple nodes. A fabric coupling the multiple nodes to each other and coupling the multiple nodes to storage accessible to each of the multiple nodes and capable of storing multiple hosts that are each executable at any of the multiple nodes. The method includes, if a fault occurs at the currently running node, discontinuing operation of the currently running node and booting the host at a free node in the HPC system from the storage.
展开▼