首页> 外文会议>ACM/IEEE International Symposium on Computer Architecture >GangES: Gang Error Simulation for Hardware Resiliency Evaluation
【24h】

GangES: Gang Error Simulation for Hardware Resiliency Evaluation

机译:恒河:硬件弹性评估的Gang Error仿真

获取原文

摘要

As technology scales, the hardware reliability challenge affects a broad computing market, rendering traditional redundancy based solutions too expensive. Software anomaly based hardware error detection has emerged as a low cost reliability solution, but suffers from Silent Data Corruptions (SDCs). It is crucial to accurately evaluate SDC rates and identify SDC producing software locations to develop software-centric low-cost hardware resiliency solutions. A recent tool, called Relyzer, systematically analyzes an entire application's resiliency to single bit soft-errors using a small set of carefully selected error injection sites. Relyzer provides a practical resiliency evaluation mechanism but still requires significant evaluation time, most of which is spent on error simulations. This paper presents a new technique called GangES (Gang Error Simulator) that aims to reduce error simulation time. GangES observes that a set or gang of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome; therefore, only one simulation of the gang needs to be completed, resulting in significant overall savings in error simulation time. GangES leverages program structure to carefully select when to compare simulations and what state to compare. For our workloads, GangES saves 57% of the total error simulation time with an overhead of just 1.6%. This paper also explores pure program analyses based techniques that could obviate the need for tools such as GangES altogether. The availability of Relyzer+GangES allows us to perform a detailed evaluation of such techniques. We evaluate the accuracy of several previously proposed program metrics. We find that the metrics we considered and their various linear combinations are unable to adequately predict an instruction's vulnerability to SDCs, further motivating the use of Relyzer+GangES style techniques as valuable solutions for the hardware error resiliency evaluation problem.
机译:作为技术尺度,硬件可靠性挑战会影响广泛的计算市场,渲染传统的基于冗余的解决方案太贵。基于软件异常的硬件错误检测已成为低成本可靠性解决方案,但遭受静默数据损坏(SDC)。至关重要的是,准确评估SDC速率并确定SDC生产软件位置,以开发以软件为中心的低成本硬件弹性解决方案。最近的工具称为霸权,系统地使用一小组仔细选择的错误注入站点分析整个应用程序的弹性对单个比特软误差。 Relyzer提供了实用的弹性评估机制,但仍需要显着的评估时间,其中大部分都花在错误仿真上。本文介绍了一种名为Ganges(Gang Error Simulator)的新技术,旨在降低误差模拟时间。恒河观察到导致相同中间执行状态(错误注入后)的集合或协议模拟将产生相同的误差结果;因此,只需要完成一个Gang的一个模拟,导致错误模拟时间内的总体总体节省。恒河利用程序结构仔细选择何时进行比较模拟以及哪些状态进行比较。对于我们的工作负载,恒河的总误差模拟时间中的57%只有1.6%的开销。本文还探讨了基于技术的纯粹的程序,可以避免了对恒星等工具的需求。 Relyzer + Ganges的可用性使我们能够对这种技术进行详细的评估。我们评估了几个先前提出的计划指标的准确性。我们发现我们考虑的度量标准和各种线性组合无法充分预测对SDC的漏洞,进一步激励依赖街机+恒河风格技术作为硬件错误弹性评估问题的有价值的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号