低代价锁步EDDI:处理器瞬时故障检测机制

王超; 傅忠传; 陈红松; 崔刚

摘要

As semiconductor technology scales into deep submicron regime, transient fault vulnerability of combinational logic increases rapidly, and will overtake that of the sequentials. Giving attention to both combinational logic and sequential logic components in a processor, especially taking aim at the hard to protect combinational logic, an instruction level transient fault detection mechanism, namely Cost-Effective Lock-stepped EDDI, is proposed. In the initial stage of design, the underling factors of both hardware detection and correction capability and system software fault detection capability are taken into account. The primary contributions of this paper are as follows: Firstly, a novel theoretical SDC(Silent Data Corruption) rate quantitative estimation methodology basing on probability theory is proposed for guidancing the tradeoff between fault tolerance and performance. It is the underestimation of the system reliability because of the ignorance of system software level fault tolerance capability, that incurs great performance over-rnhead of fault detection algorithm. Nevertheless, in the initial stage of design the underling factor of operating system is taken into account. Meanwhile, faults are classified into different categories according to the processor components utilized by an instruction. Fault injection experiment results prove that the theoretical quantitative estimations fit well. Secondly, giving attention to both combinational logic and sequential logic faults, the Cost-Effective Lock-stepped EDDI mechanism is designed, whilst a novel instruction duplication rule and a register allocation scheme are put forward. Coupling with EDAC(Error Detection And Correction) hardware for sequential logic, the Lock-Stepped EDDI especially effective for combinational logic greatly reduces performance overhead. Wherein, the novel register allocation scheme in compiler front-end results in a large reduction of reserved registers, alleviating register pressure and significantly reducing the numbers of load/store instructions. Likewise, with the introduction of register reservation, there is no need to modify function parameter passing rule, no need to recompile system libraries, which is of great importance to the generality of this mechanism. Thirdly and lastly, single-bit fault model is adopted. Representative components in SPARC processor such as decoder unit, address generation unit, and ALU are conducted for fault injection campaign. Comparing to the traditional Fully Replicated approach, Cost-Effective Lock-stepped EDDI achieves an average 35. 2% speedup in execution time and 36. 1 % reduction in number of dynamic instructions at the modest cost of an average 0. 8% increase in SDC rate.%随着ULSI工艺步入深亚微米时代,处理器内部组合逻辑的瞬时故障敏感性迅速提高,文中在设计初期将硬件寄存器纠检错能力和系统软件检错能力纳入考虑,兼顾处理器内组合逻辑、时序逻辑两类部件,设计应用级“低代价锁步EDDI(Error Detection by Duplicated Instructions)”机制.创新如下:(1)提出基于概率论的故障漏检率量化估计方法,为纠检错与性能折中进行指导.以往的应用级检错机制在设计过程中并没有考虑到下层操作系统的检错能力,这会造成可靠性估计不足而带来性能损失.文中依照指令流经的部件将故障划分为不同子类,并将操作系统纳入考虑,提出基于概率论的故障漏检率量化估计方法,理论估计与故障注入结果拟合良好.(2)低代价锁步EDDI机制,结合硬件纠检错能力,兼顾处理器内组合逻辑和时序逻辑两类部件,大幅降低了性能代价.提出独特的低代价锁步指令复制规则,并通过编译链前端的寄存器分配,大幅减少了寄存器预留数,有效缓解了寄存器压力,降低了访存代价,提高了寄存器的性能.寄存器预留也保证了本机制无需修改编译器传参规则,无需重新编译系统库,提高了通用性.(3)采用单比特故障模型,基于SPARC体系结构,选取处理器中代表性部件:解码(Decoder Unit)单元、地址生成(Address GEN Unit)单元、算逻单元(ALU)进行故障注入,对低代价锁步EDDI实现代价进行详细评测.与全复制EDDI相比,低代价锁步EDDI仅以故障漏检率SDC(Silent Data Corruption)平均升高0.8％的代价,换取了动态执行指令数平均减少36.1％,执行时间平均降低35.2％的性能优势.

低代价锁步EDDI:处理器瞬时故障检测机制

摘要

著录项

相似文献

相关主题

期刊订阅