...
首页> 外文期刊>Computer architecture news >GangES: Gang Error Simulation for Hardware Resiliency Evaluation
【24h】

GangES: Gang Error Simulation for Hardware Resiliency Evaluation

机译:GangES:用于硬件弹性评估的Gang错误模拟

获取原文
获取原文并翻译 | 示例

摘要

As technology scales, the hardware reliability challenge affects a broad computing market, rendering traditional redundancy based solutions too expensive. Software anomaly based hardware error detection has emerged as a low cost reliability solution, but suffers from Silent Data Corruptions (SDCs). It is crucial to accurately evaluate SDC rates and identify SDC producing software locations to develop software-centric low-cost hardware resiliency solutions. A recent tool, called Relyzer, systematically analyzes an entire application's resiliency to single bit soft-errors using a small set of carefully selected error injection sites. Relyzer provides a practical resiliency evaluation mechanism but still requires significant evaluation time, most of which is spent on error simulations. This paper presents a new technique called GangES (Gang Error Simulator) that aims to reduce error simulation time. GangES observes that a set or gang of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome; therefore, only one simulation of the gang needs to be completed, resulting in significant overall savings in error simulation time. GangES leverages program structure to carefully select when to compare simulations and what state to compare. For our workloads, GangES saves 57% of the total error simulation time with an overhead of just 1.6%. This paper also explores pure program analyses based techniques that could obviate the need for tools such as GangES altogether. The availability of Relyzer+GangES allows us to perform a detailed evaluation of such techniques. We evaluate the accuracy of several previously proposed program metrics. We find that the metrics we considered and their various linear combinations are unable to adequately predict an instruction's vulnerability to SDCs, further motivating the use of Relyzer+GangES style techniques as valuable solutions for the hardware error resiliency evaluation problem.
机译:随着技术的扩展,硬件可靠性挑战影响着广阔的计算市场,这使得传统的基于冗余的解决方案过于昂贵。基于软件异常的硬件错误检测已成为一种低成本的可靠性解决方案,但遭受了静默数据损坏(SDC)的困扰。准确评估SDC速率并确定SDC生产软件位置对于开发以软件为中心的低成本硬件弹性解决方案至关重要。最近一种称为Relyzer的工具使用一小组经过精心选择的错误注入站点,系统地分析了整个应用程序对单比特软错误的恢复能力。 Relyzer提供了一种实用的弹性评估机制,但仍需要大量评估时间,其中大部分时间用于错误模拟。本文提出了一种名为GangES(帮派错误模拟器)的新技术,旨在减少错误仿真时间。 GangES观察到,导致同一中间执行状态(在其错误注入之后)的一组或一组错误模拟将产生相同的错误结果。因此,只需要完成一次对组的模拟,就可以大幅度节省错误模拟时间。 GangES利用程序结构来仔细选择何时比较仿真以及要比较的状态。对于我们的工作负载,GangES节省了57%的总错误模拟时间,而开销仅为1.6%。本文还探讨了基于纯程序分析的技术,这些技术可完全消除对GangES之类工具的需求。 Relyzer + GangES的可用性使我们能够对此类技术进行详细评估。我们评估了几个以前提出的程序指标的准确性。我们发现,我们考虑的指标及其各种线性组合无法充分预测指令对SDC的脆弱性,从而进一步激发了Relyzer + GangES风格技术作为硬件错误弹性评估问题的宝贵解决方案的使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号