Cloud systems, as any other system, must be reliable. This means that the system should respond correctly in presence of failures, which are quite probable in a distributed, largely independent, system as cloud systems are. Thus, it is important that cloud systems become fault tolerant, ensuring safe recovery from failures. Since failures in clouds may come from several different sources, although a major role comes from communication failures, the techniques that can be applied to assure reliability are also very different. This survey presents a systematic review of solutions to provide fault tolerance in open source clouds. Our goal with this review is to provide to cloud managers a guided approach to choose a solution for a given problem or system.
展开▼