首页> 外文学位 >System-Level Electromigration-Induced Dynamic Reliability Management
【24h】

System-Level Electromigration-Induced Dynamic Reliability Management

机译:系统级电迁移诱导的动态可靠性管理

获取原文
获取原文并翻译 | 示例

摘要

Technology scaling has led to further processor integration, and future manycore chips will have more cores integrated. However, due to the diminishing of Dennard's scaling, the power density of chips starts to increase for current and future technology nodes. Because of this, only a certain percentage of a manycore processor can be powered on because of power and temperature limitations. These trends have resulted in so-called dark silicon manycore processors. Additionally, reliability is becoming a limiting constraint in high-performance nanometer VLSI chip designs due to the high failure rates in deep submicron and nanoscale devices. It is expected that future chips will show signs of reliability-induced aging much faster than the previous generations. Among of many reliability effects, electromigration (EM)-induced reliability has become a major design constraint due to the aggressive transistor and interconnect scaling and increasing power density.;This thesis focuses on developing new system level EM-induced dynamic reliability managements on many different systems. Specifically, first, I develop system level management for real-time embedded systems. I investigate a new lifetime optimization technique for real-time embedded processors considering the electromigration-induced reliability. The new approach is based on a recently proposed physics-based electromigration (EM) model for more accurate EM assessment of a power grid network at the chip level. Second, I develop a new energy and lifetime optimization techniques for emerging dark silicon manycore microprocessors considering both hard long-term reliability effects (hard errors) and transient soft errors. To optimize EM-induced lifetime, I apply the adaptive Q-learning based method, which is suitable for dynamic runtime operation as it can provide cost-effective yet good solutions. Third, I develop a new dynamic reliability management (DRM) techniques at the system level for emerging low power dark silicon manycore microprocessors operating in near-threshold region. I mainly consider the electromigration (EM) recovery effects. To leverage the EM recovery effects, which was ignored in the past, at the system-level, I develop a new equivalent DC current model to consider recovery effects for general time-varying current waveforms so that existing compact EM model can be applied. Fourth, I develop a new approach for cross-layer electromigration (EM) induced reliability modeling and optimization at physics, system and data center levels. To speed up the online optimization for energy in a data center, I investigate a new combined data center power and reliability compact model using a learning based approach in which a feed-forward neural network (FNN) is trained to predict energy and long term reliability for each processor under data center scheduling and workloads. Lastly, I develop long-term reliability management for GPU architectures using spatial multitasking, which allows GPU computing resources to be partitioned among multiple applications. I find that the existing reliability-agnostic thread block scheduler for spatial multitasking is effective in achieving high GPU utilization, but poor in reliability. I develop and implement a long-term reliability-aware thread block scheduler in GPGPU-sim, and compare it against the existing reliability-agnostic scheduler.
机译:技术的扩展已导致进一步的处理器集成,并且未来的许多核芯片将集成更多的核。但是,由于Dennard缩放比例的减小,当前和未来技术节点的芯片功率密度开始增加。因此,由于功率和温度的限制,只能在一定比例的多核处理器上电。这些趋势导致了所谓的暗硅多核处理器。此外,由于深亚微米和纳米级设备的高故障率,可靠性已成为高性能纳米VLSI芯片设计中的限制因素。预计未来的芯片将显示出比前几代产品更快的可靠性引起的老化迹象。在许多可靠性影响中,电迁移(EM)引起的可靠性已成为主要的设计约束,这是由于晶体管和互连的规模不断扩大以及功率密度增加所致。;本论文着重于在许多不同的方面开发新的系统级EM引起的动态可靠性管理系统。具体来说,首先,我为实时嵌入式系统开发系统级管理。考虑到电迁移引起的可靠性,我研究了一种针对实时嵌入式处理器的新型寿命优化技术。新方法基于最近提出的基于物理的电迁移(EM)模型,用于在芯片级对电网网络进行更准确的EM评估。其次,我为新兴的暗硅多核微处理器开发了一种新的能量和寿命优化技术,同时考虑了长期的硬可靠性(硬错误)和瞬态软错误。为了优化由EM引起的寿命,我应用了基于自适应Q学习的方法,该方法适用于动态运行时操作,因为它可以提供经济高效的解决方案。第三,我在系统级别开发了一种新的动态可靠性管理(DRM)技术,用于在阈值附近运行的新兴低功耗暗硅多核微处理器。我主要考虑电迁移(EM)的恢复作用。为了利用过去在系统级别上被忽略的EM恢复效应,我开发了一个新的等效DC电流模型来考虑一般时变电流波形的恢复效应,以便可以应用现有的紧凑EM模型。第四,我为物理,系统和数据中心级别的跨层电迁移(EM)引起的可靠性建模和优化开发了一种新方法。为了加快数据中心能源的在线优化,我使用基于学习的方法研究了一种新的组合式数据中心功率和可靠性紧凑模型,该模型中训练了前馈神经网络(FNN)以预测能量和长期可靠性针对数据中心调度和工作负载下的每个处理器。最后,我使用空间多任务处理技术为GPU架构开发了长期可靠性管理,这使GPU计算资源可以在多个应用程序之间进行分区。我发现现有的用于空间多任务的与可靠性无关的线程块调度程序可以有效地实现较高的GPU利用率,但可靠性较差。我在GPGPU-sim中开发并实施了一个长期的可靠性感知线程块调度程序,并将其与现有的与可靠性无关的调度程序进行了比较。

著录项

  • 作者

    Kim, Taeyoung.;

  • 作者单位

    University of California, Riverside.;

  • 授予单位 University of California, Riverside.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 187 p.
  • 总页数 187
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号