Modern computer devices exhibit transient hardware faults that disturb the electrical behavior but do not cause permanent physical damage to the devices. Transient faults are caused by a multitude of sources, such as fluctuation of the supply voltage, electromagnetic interference, and radiation from the natural environment. Therefore, dependable computer systems must incorporate methods of fault tolerance to cope with transient faults. Software-implemented fault tolerance represents a promising approach that does not need expensive hardware redundancy for reducing the probability of failure to an acceptable level.This thesis focuses on software-implemented fault tolerance for operating systems because they are the most critical pieces of software in a computer system: All computer programs depend on the integrity of the operating system. However, the C/C++ source code of common operating systems tends to be already exceedingly complex, so that a manual extension by fault tolerance is no viable solution. Thus, this thesis proposes a generic solution based on Aspect-Oriented Programming (AOP).To evaluate AOP as a means to improve the dependability of operating systems, this thesis presents the design and implementation of a library of aspect-oriented fault-tolerance mechanisms. These mechanisms constitute separate program modules that can be integrated automatically into common off-the-shelf operating systems using a compiler for the AOP language. Thus, the aspect-oriented approach facilitates improving the dependability of large-scale software systems without affecting the maintainability of the source code. The library allows choosing between several error-detection and error-correction schemes, and provides wait-free synchronization for handling asynchronous and multi-threaded operating-system code.This thesis evaluates the aspect-oriented approach to fault tolerance on the basis of two off-the-shelf operating systems. Furthermore, the evaluation also considers one user-level program for protection, as the library of fault-tolerance mechanisms is highly generic and transparent and, thus, not limited to operating systems. Exhaustive fault-injection experiments show an excellent trade-off between runtime overhead and fault tolerance, which can be adjusted and optimized by fine-grained selective placement of the fault-tolerance mechanisms. Finally, this thesis provides evidence for the effectiveness of the approach in detecting and correcting radiation-induced hardware faults: High-energy particle radiation experiments confirm improvements in fault tolerance by almost 80 percent.
展开▼
机译:现代计算机设备表现出暂时的硬件故障,这些故障会干扰电气行为,但不会对设备造成永久的物理损坏。瞬态故障是由多种原因引起的,例如电源电压的波动,电磁干扰和自然环境的辐射。因此,可靠的计算机系统必须结合容错方法来应对瞬态故障。软件实现的容错性是一种有前途的方法,它不需要昂贵的硬件冗余就可以将故障概率降低到可接受的水平。本文着重研究操作系统的软件实现的容错性,因为它们是操作系统中最关键的软件。计算机系统:所有计算机程序都取决于操作系统的完整性。但是,常见操作系统的C / C ++源代码已经非常复杂,因此通过容错进行手动扩展不是可行的解决方案。因此,本文提出了一种基于面向方面编程(AOP)的通用解决方案。为了评估AOP作为提高操作系统可靠性的一种手段,本文提出了面向方面的容错机制库的设计与实现。 。这些机制构成了单独的程序模块,可以使用AOP语言的编译器将这些程序模块自动集成到常见的现货操作系统中。因此,面向方面的方法有助于提高大型软件系统的可靠性,而不会影响源代码的可维护性。该库允许在几种错误检测方案和错误纠正方案之间进行选择,并提供了用于处理异步和多线程操作系统代码的无等待同步。本文在两个方面的基础上,评估了面向方面的容错方法。现成的操作系统。此外,该评估还考虑了一种用户级别的保护程序,因为容错机制的库是高度通用和透明的,因此不限于操作系统。详尽的故障注入实验显示了运行时开销与容错之间的极佳折衷,可以通过对容错机制进行细粒度的选择性放置来进行调整和优化。最后,本文为该方法在检测和纠正辐射引起的硬件故障中的有效性提供了证据:高能粒子辐射实验证实了将容错能力提高了近80%。
展开▼