Providing fault-tolerance for parallel/distributed applications is a problem of paramount importance, since the overall failure rate of the system increases with the number of processors, and the failure of just one processor can lend to the complete crash of the program. Checkpointing mechanisms are a good candidate to provide the continuity of the applications in the occurrence of failures. In this paper, we present an experimental study of several variations of checkpointing for SPMD (single process, multiple data) applications. We used a typical benchmark to experimentally assess the overhead, advantages and limitations of each checkpointing scheme.
展开▼