首页> 外文OA文献 >Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment

【2h】

Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment

机译：通过软件异常处理从内核硬件故障中检测和恢复

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Aggressive scaling of CMOS transistors has enabled extensive system integration and building faster and more e???cient systems. On the ???ip side, this has resulted in an increasing number of devices that fail in shipped components in-the-???eld for a variety of reasons including soft errors, wear-out failures, and infant mortality. The pervasiveness of the problem across a broad market demands low cost and generic reliability solutions, precluding traditional solutions that employed excessive redundancy or piecemeal solutions that address only a few failure modes. This dissertation presents SWAT (SoftWare Anomaly Treatment), a low cost resiliency solution that e???ectively handles hardware faults while incurring low cost during the common mode of fault-free operations. SWAT is based on two key observations about the design of resilient systems. First, only those hardware faults that a???ect software need to be handled and second, since the common mode of operation is fault-free, fault-free execution should incur near-zero overheads. SWAT thus uses novel zero to low cost hardware and software monitors that watch for anomalous software behavior to detect hardware faults. SWAT then relies on hardware support for checkpointing and rollback recovery. When dealing with fault recovery in the presence of I/O, we identify that existing software-level mechanisms that handle output bu???ering fall short. This dissertation therefore pro- poses a simple low-cost hardware bu???er for output bu???ering and demonstrates that this strategy achieves high recoverability while incurring low overheads. Although not detailed in this dissertation, SWAT contains a comprehensive diagnosis procedure that is invoked in the rare event of a fault to isolate the root-cause of the fault by distinguishing between software bugs, transient hardware faults, and permanent hardware faults. E???ectively, SWAT handles hardware faults uniformly as software bugs, amortizing the resiliency cost across both hardware and software reliability.The results in this dissertation show that the SWAT strategy is e???ective to detect and recover the system from a variety of in-core permanent and transient faults in various microarchitecture units for both compute-intensive and I/O-intensive workloads. In particular, this dissertation demonstrates that the SWAT detectors detect nearly all permanent and transient faults in most hardware units in both types of workloads, with only a small fraction of the faults corrupting application output.(Certain hardware structures like the FPU may need additional support to be amenable to software anomaly detection.) Further, a majority of these faults are tolerated by the applications due to their inherent fault-tolerant nature, resulting in only 0.2% of the injected faults a???ecting the application and yielding incorrect outputs (such faults are classi???ed as Silent Data Corruptions, or SDCs). When attempting to recover the detected faults, we show that handling I/O is important for fault recovery. With our proposed low-cost hardware for output bu???ering, we show that over 94% of the detected faults are recoverable with low performance and area overheads during fault-free execution even in the presence of I/O. Finally, this dissertation builds a fundamental understanding behind why the SWAT strategy is e???ective for handling faults in modern workloads. The key insight is that the SWAT detectors are adept at detecting perturbations in control operations and memory addresses and a majority of the application values a???ect such operations. Faults in values that that never a???ect such operations are hard-to-detect and require additional support to be amenable to software anomaly detection.In summary, this dissertation presents SWAT as a complete solution to detect and recover from from in-core hardware faults. The techniques presented here therefore have far reaching implications on the design of low-cost solutions to handle unreliable hardware.

机译：CMOS晶体管的大规模扩展已经实现了广泛的系统集成，并构建了更快，更有效的系统。在IP方面，由于各种原因，包括软错误，磨损故障和婴儿死亡率，导致在设备出厂时出现故障的设备数量不断增加。该问题在广泛的市场中普遍存在，因此需要低成本和通用的可靠性解决方案，这要排除采用过多冗余的传统解决方案或仅解决少数故障模式的零碎解决方案。本文提出了一种SWAT（软件异常处理）软件，它是一种低成本的弹性解决方案，可以有效地处理硬件故障，同时在普通的无故障操作模式下降低成本。 SWAT基于有关弹性系统设计的两个主要观察结果。首先，仅需要处理软件的那些硬件故障，其次，由于通用的操作模式是无故障的，因此无故障的执行将产生接近零的开销。因此，特警队使用新颖的零到低成本硬件和软件监视器，监视软件异常行为以检测硬件故障。然后，SWAT依靠硬件支持进行检查点和回滚恢复。在存在I / O的情况下处理故障恢复时，我们发现处理输出缓冲区的现有软件级机制不足。因此，本文提出了一种简单的低成本硬件缓冲器，用于输出缓冲器，并证明了该策略在实现高可恢复性的同时又降低了开销。尽管在本文中没有详细介绍，但是SWAT包含一个综合的诊断过程，在罕见的故障事件中会调用该诊断程序，以通过区分软件错误，瞬时硬件故障和永久性硬件故障来隔离故障的根本原因。有效地，SWAT将硬件故障作为软件错误统一处理，从而分摊了软硬件可靠性之间的弹性成本。本文的结果表明，SWAT策略可有效地从硬件中检测和恢复系统。针对计算密集型和I / O密集型工作负载，各种微体系结构单元中的各种内核内永久性故障和瞬态故障。特别是，本文证明了SWAT检测器可以检测两种工作负载中大多数硬件单元中的几乎所有永久性故障和瞬态故障，只有一小部分故障会破坏应用程序的输出（某些硬件结构（如FPU）可能需要额外的支持。此外，由于其固有的容错特性，应用程序可以容忍这些故障中的大多数，导致仅0.2％的注入故障影响了应用程序并产生了错误的输出。（此类故障被分类为静默数据损坏或SDC）。当尝试恢复检测到的故障时，我们表明处理I / O对于故障恢复很重要。使用我们提出的用于输出缓冲的低成本硬件，我们证明即使在存在I / O的情况下，即使在无故障执行期间，仍可通过低性能和面积开销恢复94％以上的检测到的故障。最后，本文建立了为什么SWAT策略有效处理现代工作负载中的错误的基本理解。关键的洞察力是，SWAT检测器擅长检测控制操作和存储器地址中的干扰，并且大多数应用程序值都与此类操作有关。从未执行过此类操作的值中的错误很难检测，并且需要额外的支持以适应软件异常检测。总之，本文提出了SWAT作为检测和从内部恢复的完整解决方案。核心硬件故障。因此，此处介绍的技术对处理不可靠硬件的低成本解决方案的设计具有深远的影响。

著录项

作者
Ramachandran Pradeep;
展开▼
作者单位

展开▼
年度 2011
总页数
原文格式 PDF
正文语种 {"code":"en","name":"English","id":9}
中图分类

相似文献

外文文献
中文文献
专利

1. A software methodology for detecting hardware faults in VLIW data paths [J] . Bolchini C. IEEE Transactions on Reliability . 2003,第4期

机译：用于检测VLIW数据路径中的硬件故障的软件方法
2. A Software Methodology for Detecting Hardware Faults in VLIW Data Paths [J] . Cristiana Bolchini IEEE Transactions on Reliability . 2003,第4期

机译：检测VLIW数据路径中的硬件故障的软件方法论
3. Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications [J] . Kim K.H., Welch H.O. IEEE Transactions on Computers . 1989,第5期

机译：分布式执行恢复块：一种在实时应用程序中统一处理硬件和软件故障的方法
4. Software-Based Detecting and Recovering from ECC-Memory Faults [C] . Zhang Xingjun, Wang Endong, Zhang Dong, Third IEEE International Conference on Intelligent Networking and Collaborative Systems . 2011

机译：基于软件的ECC内存故障检测和恢复
5. Detecting and recovering from in-core hardware faults through software anomaly treatment. [D] . Ramachandran, Pradeep. 2011

机译：通过软件异常处理检测并从核心内硬件故障中恢复。
6. Analysis of the Performance of the Software/Hardware Product MyDiaBase+RxChecker for Assessing Treatment Regimens [O] . Danny Petrasek, Marissa Bidner 2009

机译：用于评估治疗方案的软件/硬件产品MyDiaBase + RxChecker的性能分析
7. Control and diagnostics of faults in hardware-software complex [O] . D. A. Pankov, L. A. Denisova 2018

机译：硬件软件复合体中断层的控制与诊断

Detecting and Recovering from In-Core Hardware Faults Through Software Anomaly Treatment

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅