首页> 外文OA文献 >Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment
【2h】

Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment

机译:通过软件异常处理从内核硬件故障中检测和恢复

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Aggressive scaling of CMOS transistors has enabled extensive system integration and building faster and more e???cient systems. On the ???ip side, this has resulted in an increasing number of devices that fail in shipped components in-the-???eld for a variety of reasons including soft errors, wear-out failures, and infant mortality. The pervasiveness of the problem across a broad market demands low cost and generic reliability solutions, precluding traditional solutions that employed excessive redundancy or piecemeal solutions that address only a few failure modes. This dissertation presents SWAT (SoftWare Anomaly Treatment), a low cost resiliency solution that e???ectively handles hardware faults while incurring low cost during the common mode of fault-free operations. SWAT is based on two key observations about the design of resilient systems. First, only those hardware faults that a???ect software need to be handled and second, since the common mode of operation is fault-free, fault-free execution should incur near-zero overheads. SWAT thus uses novel zero to low cost hardware and software monitors that watch for anomalous software behavior to detect hardware faults. SWAT then relies on hardware support for checkpointing and rollback recovery. When dealing with fault recovery in the presence of I/O, we identify that existing software-level mechanisms that handle output bu???ering fall short. This dissertation therefore pro- poses a simple low-cost hardware bu???er for output bu???ering and demonstrates that this strategy achieves high recoverability while incurring low overheads. Although not detailed in this dissertation, SWAT contains a comprehensive diagnosis procedure that is invoked in the rare event of a fault to isolate the root-cause of the fault by distinguishing between software bugs, transient hardware faults, and permanent hardware faults. E???ectively, SWAT handles hardware faults uniformly as software bugs, amortizing the resiliency cost across both hardware and software reliability.The results in this dissertation show that the SWAT strategy is e???ective to detect and recover the system from a variety of in-core permanent and transient faults in various microarchitecture units for both compute-intensive and I/O-intensive workloads. In particular, this dissertation demonstrates that the SWAT detectors detect nearly all permanent and transient faults in most hardware units in both types of workloads, with only a small fraction of the faults corrupting application output.(Certain hardware structures like the FPU may need additional support to be amenable to software anomaly detection.) Further, a majority of these faults are tolerated by the applications due to their inherent fault-tolerant nature, resulting in only 0.2% of the injected faults a???ecting the application and yielding incorrect outputs (such faults are classi???ed as Silent Data Corruptions, or SDCs). When attempting to recover the detected faults, we show that handling I/O is important for fault recovery. With our proposed low-cost hardware for output bu???ering, we show that over 94% of the detected faults are recoverable with low performance and area overheads during fault-free execution even in the presence of I/O. Finally, this dissertation builds a fundamental understanding behind why the SWAT strategy is e???ective for handling faults in modern workloads. The key insight is that the SWAT detectors are adept at detecting perturbations in control operations and memory addresses and a majority of the application values a???ect such operations. Faults in values that that never a???ect such operations are hard-to-detect and require additional support to be amenable to software anomaly detection.In summary, this dissertation presents SWAT as a complete solution to detect and recover from from in-core hardware faults. The techniques presented here therefore have far reaching implications on the design of low-cost solutions to handle unreliable hardware.
机译:CMOS晶体管的大规模扩展已经实现了广泛的系统集成,并构建了更快,更有效的系统。在IP方面,由于各种原因,包括软错误,磨损故障和婴儿死亡率,导致在设备出厂时出现故障的设备数量不断增加。该问题在广泛的市场中普遍存在,因此需要低成本和通用的可靠性解决方案,这要排除采用过多冗余的传统解决方案或仅解决少数故障模式的零碎解决方案。本文提出了一种SWAT(软件异常处理)软件,它是一种低成本的弹性解决方案,可以有效地处理硬件故障,同时在普通的无故障操作模式下降低成本。 SWAT基于有关弹性系统设计的两个主要观察结果。首先,仅需要处理软件的那些硬件故障,其次,由于通用的操作模式是无故障的,因此无故障的执行将产生接近零的开销。因此,特警队使用新颖的零到低成本硬件和软件监视器,监视软件异常行为以检测硬件故障。然后,SWAT依靠硬件支持进行检查点和回滚恢复。在存在I / O的情况下处理故障恢复时,我们发现处理输出缓冲区的现有软件级机制不足。因此,本文提出了一种简单的低成本硬件缓冲器,用于输出缓冲器,并证明了该策略在实现高可恢复性的同时又降低了开销。尽管在本文中没有详细介绍,但是SWAT包含一个综合的诊断过程,在罕见的故障事件中会调用该诊断程序,以通过区分软件错误,瞬时硬件故障和永久性硬件故障来隔离故障的根本原因。有效地,SWAT将硬件故障作为软件错误统一处理,从而分摊了软硬件可靠性之间的弹性成本。本文的结果表明,SWAT策略可有效地从硬件中检测和恢复系统。针对计算密集型和I / O密集型工作负载,各种微体系结构单元中的各种内核内永久性故障和瞬态故障。特别是,本文证明了SWAT检测器可以检测两种工作负载中大多数硬件单元中的几乎所有永久性故障和瞬态故障,只有一小部分故障会破坏应用程序的输出(某些硬件结构(如FPU)可能需要额外的支持。此外,由于其固有的容错特性,应用程序可以容忍这些故障中的大多数,导致仅0.2%的注入故障影响了应用程序并产生了错误的输出。 (此类故障被分类为静默数据损坏或SDC)。当尝试恢复检测到的故障时,我们表明处理I / O对于故障恢复很重要。使用我们提出的用于输出缓冲的低成本硬件,我们证明即使在存在I / O的情况下,即使在无故障执行期间,仍可通过低性能和面积开销恢复94%以上的检测到的故障。最后,本文建立了为什么SWAT策略有效处理现代工作负载中的错误的基本理解。关键的洞察力是,SWAT检测器擅长检测控制操作和存储器地址中的干扰,并且大多数应用程序值都与此类操作有关。从未执行过此类操作的值中的错误很难检测,并且需要额外的支持以适应软件异常检测。总之,本文提出了SWAT作为检测和从内部恢复的完整解决方案。核心硬件故障。因此,此处介绍的技术对处理不可靠硬件的低成本解决方案的设计具有深远的影响。

著录项

  • 作者

    Ramachandran Pradeep;

  • 作者单位
  • 年度 2011
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号