...
首页> 外文期刊>Parallel Computing >Towards an immortal operating system in virtual environments
【24h】

Towards an immortal operating system in virtual environments

机译:在虚拟环境中迈向永生的操作系统

获取原文
获取原文并翻译 | 示例
           

摘要

Many OS crashes are caused by bugs in kernel extensions or device drivers while the OS itself may have been tested rigorously. To make an OS immortal we must resurrect the OS from these crashes. We present a novel OS-hypervisor infrastructure that allows automated and transparent OS crash diagnosis and recovery in a virtual environment. This infrastructure eliminates the need for reboots or checkpoint-restart mechanisms, which require preserving the states of critical applications before the crash happens and also require extensive modifications to those applications. At the core of our approach is a small hidden OS-repair-image that is dynamically created from the healthy running OS instance. When an OS crashes, the hypervisor dynamically loads this repair-image to perform diagnosis and repair. One way of repair we have experimented with, is to quarantine the offending process and resume the running of the fixed OS automatically without a reboot. Experimental evaluations demonstrated that it takes less than 3 s to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers, and is the first design and implementation of an OS-hypervisor combo capable of automatically resurrecting a crashed commercial server-OS. In addition to online diagnosis and recovery, this infrastructure can also be used for offline diagnosis and can be incorporated into the technical support tools of the OS vendor. Additionally, we have used parts of this infrastructure to speed-up the diagnosis of A1X OS-crashes for the IBM technical support teams.
机译:许多操作系统崩溃是由内核扩展或设备驱动程序中的错误引起的,而操作系统本身可能已经经过了严格的测试。为了使操作系统永生,我们必须从这些崩溃中恢复操作系统。我们提出了一种新颖的操作系统管理程序基础架构,该基础架构允许在虚拟环境中进行自动且透明的操作系统崩溃诊断和恢复。这种基础结构消除了重新启动或检查点重新启动机制的需要,这些机制需要在崩溃发生之前保留关键应用程序的状态,并且还需要对这些应用程序进行大量修改。我们方法的核心是一个小的隐藏的OS修复映像,它是从运行状况良好的OS实例动态创建的。当操作系统崩溃时,系统管理程序会动态加载此修复映像以执行诊断和修复。我们尝试过的一种修复方法是隔离有问题的进程,并在不重新启动的情况下自动恢复固定OS的运行。实验评估表明,从操作系统崩溃中恢复所需的时间少于3秒。这种方法可以显着减少数据中心的停机时间和维护成本,并且是能够自动恢复崩溃的商用服务器OS的OS管理程序组合的第一个设计和实现。除了在线诊断和恢复之外,该基础结构还可以用于离线诊断,并且可以合并到OS供应商的技术支持工具中。此外,我们还使用了该基础架构的一部分来加快IBM技术支持团队对A1X OS崩溃的诊断。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号