Scientific workflows are data and compute intensive thus may run for days or even for weeks on parallel and distributed infrastructures such as HPC systems and cloud. In HPC environment the number of failures that can arise during scientific workflow enactment can be high so the use of fault tolerance techniques is unavoidable. The most frequently used fault tolerance techniques are job replication and checkpointing. While job replication is based on the assumption that the probability of single failures is much higher than of simultaneous failures, the checkpointing saves certain states and the execution can be restarted from that point later on. The effectiveness of the checkpointing method depends on the checkpointing interval. Common technique is to dynamically adapt the checkpointing interval. In this work we give a brief overview of the different checkpointing techniques and propose a new provenance based dynamic checkpointing method.
展开▼