首页> 外文学位 >Reliable ultra-low-voltage cache design for many-core systems.
【24h】

Reliable ultra-low-voltage cache design for many-core systems.

机译:用于多核系统的可靠超低压缓存设计。

获取原文
获取原文并翻译 | 示例

摘要

We classify cache errors as hard or soft errors. Hard errors may be caused by manufacturing defects, threshold or supply voltage variations, or device aging, and soft errors are introduced by external particle strikes or other random noise. Traditionally, most soft errors manifest as single event upset. However, as we approach into the nanometer era, the probability of multi-bit upset increases significantly because a single particle strike can cause more cache cell upsets. To address both single bit upset and multi-bit upset, we propose two-layer error control codes, combining the error detection capability of a rectangular code and the error correction capability of a Hamming product code in an efficient way, to significantly improve system reliability while maintaining low area, power, and latency overhead.;To reduce supply voltage beyond normally acceptable VDDMIN and maintain appropriate yield and reliability, we exploit existing double-error correcting triple-error detecting (DECTED) codes, together with cache line disabling in an efficient way to handle both hard and soft errors. The proposed method uses DECTED codes for each cache line---1-bit error correction for hard errors, and the other 1-bit error correction for soft errors. When there are multiple faulty cells, the cache lines will be disabled. This approach can reduce supply voltage beyond normally acceptable VDDMIN and maintain appropriate yield and reliability. To further improve energy efficiency, an adaptive fault-tolerant cache architecture, which provides appropriate error control capability for each cache line based on the number of faulty cells detected, is proposed. We use single-error correcting double-error detecting (SECDED) codes for each cache lines to address soft errors, and extra parity bits are used when there are hard errors. Our experimental results show that the proposed method can further reduce supply voltage and increase cache reliability.;We also propose a two-layer error control code, combining error detection capability of rectangular codes and error correction capability of Hamming product codes in an efficient way, in order to increase cache error resilience for many core systems, while maintaining low power, area and latency overhead. Based on the fact of low latency and overhead of rectangular codes and high error control capability of Hamming product codes, two-layer error control codes employ simple rectangular codes for each cache line to detect cache errors, while loading the extra Hamming product code checks bits in the case of error detection; thus enabling reliable large-scale cache operations. Analysis and experiments are conducted to evaluate the cache fault-tolerant capability of various existing solutions and the proposed approach. The results show that the proposed approach can significantly increase Mean-Error-To-Failure (METF) and Mean-Time-To-failure (MTTF) up to 2.8x, reduce storage overhead by over 57%, and increase instruction per-cycle (IPC) up to 7%, compared to complex four-way 4EC5ED; and it increases METF and MTTF up to 133x, reduces storage overhead by over 11%, and achieves a similar IPC compared to simple eight-way SECDED. The cost of the proposed approach is no more than 4% external memory access overhead. In order to improve system reliability in the scenario of cache coherence protocol, two different approaches are proposed: pre-write-back policy and uneven error-protection. Pre-write-back cache policy can reduce the number of cache lines with "irrecoverable" cache states, and uneven error-protection provides appropriate error control mechanisms for each cache line based on its cache state. Our analysis and experimental results show that the proposed uneven error-protection approach with pre-write-back policy can improve system reliability significantly. (Abstract shortened by ProQuest.).
机译:我们将缓存错误分为硬错误或软错误。硬错误可能是由制造缺陷,阈值或电源电压变化或设备老化引起的,而软错误是由外部粒子撞击或其他随机噪声引起的。传统上,大多数软错误表现为单事件失败。但是,随着我们进入纳米时代,由于单个粒子撞击会导致更多的缓存单元崩溃,因此多位崩溃的可能性显着增加。为了解决单比特翻转和多比特翻转问题,我们提出了两层错误控制码,将矩形码的错误检测能力和汉明乘积码的错误纠正能力有效地结合在一起,以显着提高系统可靠性。为了降低电源电压,使其超出正常可接受的VDDMIN并保持适当的良率和可靠性,我们利用现有的双纠错三重错误检测(DECTED)代码,并在高速缓存中禁用高速缓存行。处理硬错误和软错误的有效方法。所提出的方法对每个高速缓存行使用DECTED代码-硬错误的-1位纠错,软错误的其他1位纠错。当有多个故障单元时,将禁用高速缓存行。这种方法可以将电源电压降低到通常可接受的VDDMIN以上,并保持适当的良率和可靠性。为了进一步提高能源效率,提出了一种自适应容错缓存架构,该架构根据检测到的故障单元数为每条缓存线提供适当的错误控制能力。我们为每条高速缓存行使用单错误校正双错误检测(SECDED)代码来解决软错误,并且在出现硬错误时会使用额外的奇偶校验位。实验结果表明,该方法可以进一步降低电源电压,提高缓存的可靠性。我们还提出了一种两层错误控制码,将矩形码的错误检测能力和汉明积码的纠错能力有效地结合在一起,为了提高许多核心系统的缓存错误恢复能力,同时保持较低的功耗,面积和延迟开销。基于矩形代码的低等待时间和开销以及汉明产品代码的高错误控制能力的事实,两层错误控制代码对每个高速缓存行采用简单的矩形代码来检测高速缓存错误,同时加载额外的汉明产品代码检查位在错误检测的情况下;从而实现可靠的大规模高速缓存操作。进行分析和实验,以评估各种现有解决方案和所提出方法的缓存容错能力。结果表明,所提出的方法可以显着提高平均故障率(METF)和平均故障时间(MTTF)到2.8倍,将存储开销减少超过57%,并增加每个周期的指令(IPC)高达7%,而复杂的四路4EC5ED则更高;与简单的八向SECDED相比,它可以将METF和MTTF增加到133倍,将存储开销减少了11%以上,并实现了类似的IPC。提出的方法的成本不超过外部存储器访问开销的4%。为了提高高速缓存一致性协议场景下的系统可靠性,提出了两种不同的方法:预写回策略和不均匀错误保护。预写回高速缓存策略可以减少具有“不可恢复”高速缓存状态的高速缓存行的数量,并且不均匀的错误保护会根据每个高速缓存行的高速缓存状态提供适当的错误控制机制。我们的分析和实验结果表明,采用预写回策略的不均匀错误保护方法可以显着提高系统可靠性。 (摘要由ProQuest缩短。)。

著录项

  • 作者

    Zhang, Meilin.;

  • 作者单位

    University of Rochester.;

  • 授予单位 University of Rochester.;
  • 学科 Electrical engineering.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 234 p.
  • 总页数 234
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:47:10

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号