This paper describes the Grid Enactor and Management Service (GEMS), a system supporting submission, monitoring, and restart of Grid jobs. GEMS supports the detection of individual job process failures for parallel message-passing applications. Failed jobs can be canceled and restarted, either on the same local resource if sufficient nodes are available in a restart queue, or on another resource. GEMS requires that a local resource manager support certain fault-detection and reporting capabilities. These capabilities are implemented in DQ, a prototype cluster scheduler.
展开▼