Certain types of system faults, notably data errors due to transient faults, can be repaired by software. The reapir consists of identifying faulty variables and then rewriting data to correct the fault. If fault identification is imprecise, repair procedures can contaminate non faulty processes from data originating at faulty processes. This contamination danger is resolved by delaying data correction for a sufficiently long period. In order to delay correction, processes use a repair timer. This paper considers the problem of how asynchronous processes can implement a repair timer that is itself subject to faults. The main results are requirement specifications for a distributed repair timer and a repair timer algorithm. The algorithm self-stabilizes in O(D) rounds, where D is the diameter of the network, and provides reliable timing from k-faulty configurations within O(k) rounds.
展开▼