The present invention relates to a method and apparatus for ensuring fault detection and system recovery in a multiprocessor computing system. This system comprises a multitude of processing element modules, input/output processor modules and shared memory modules. Each module within the system includes an identical period sanity timer capable to reset the module once a predetermined limit count is reached. If a global clear signal is not received from the operating system scheduler by all modules prior to the expiry of the sanity timers, a system-wide reset is effected. Each processing element module within the system further includes a watchdog timer capable to reset the module once a predetermined limit count is reached. If a process is not run by the operating system scheduler on the processing element before the expiry of the watchdog timer, effectively clearing the watchdog timer, the processing element is reset and removed from service.
展开▼