Large-scale distributed systems are very attractive for theexecution of parallel applications requiring a huge computing power.However, their high probability of site failure is unacceptable,especially for long time running applications. In this paper, we addressthis problem and propose a checkpointing mechanism relying on arecoverable distributed shared memory (DSM) in order to tolerate singlenode failures. Although most recoverable DSMs require specific hardwareto store recovery data, our scheme uses standard memories to store bothcurrent and recovery data. Moreover, the management of recovery data ismerged with the management of current data by extending the DSM'scoherence protocol. This approach takes advantage of the datareplication provided by a DSM in order to limit the amount oftransferred pages during the checkpointing. The paper also presents animplementation and a preliminary performance evaluation of ourrecoverable DSM on a 56-node Intel Paragon
展开▼