The problem of recovering distributed systems from crash failures has been widely studied in the context of traditional non-threaded processes. However, extending those solutions to the multi-threaded scenario presents new problems. We identify and address these problems for optimistic logging protocols. There are two natural extension to optimistic logging protocols in the multi-threaded scenario. The first extension is process-centric, where the points of internal non-determinism caused by threads are logged. The second extension is thread-centric, where each thread is treated as a separate process. The process-centric approach suffers from false causality while the thread-centric approach suffers from high causality tracking overhead. By observing that the granularity of failures can be different from the granularity of rollbacks, we design a new balanced approach which incurs low causality tracking overhead and also eliminates false causality.
展开▼