首页> 外文学位 >Architecting Memory Systems Upon Highly Scaled Error-Prone Memory Technologies
【24h】

Architecting Memory Systems Upon Highly Scaled Error-Prone Memory Technologies

机译:基于高度扩展的错误码存储技术构建存储系统

获取原文
获取原文并翻译 | 示例

摘要

DRAM (dynamic random access memory) technology has been fueling the computing industry for almost five decades and plays an essential role in enabling modern information technology infrastructure. However, as the DRAM technology scaling approaches 20nm and below, it has become increasingly challenging to maintain the historical bit cost reduction. In particular, with the DRAM technology scaling towards sub-20nm, it becomes more and more difficult to achieve sufficient DRAM data retention time. The DRAM cells with relatively shorter retention time are referred as weak cells and may fail to keep stored data in certain refresh period (e.g., 64ms or 128ms in current practice). Thus, tremendous efforts have been devoted to seeking alternative memory technologies. Several emerging memory technologies have been considered as the promising candidates, for example, Spin-Transfer-Torque (STT) RAM and Phase Change Memory (PCM). Although these emerging memory technologies may have advantages in scaling, they inevitably face cost, capacity and reliability challenges. In conventional practice, all the erroneous memory cells are masked by redundancy repair and error control codes (ECC), which are invisible to outside. However, it becomes impractical for memory industry to keep this design philosophy in sub-20 nm region.;This thesis presents a series of orthogonal memory system design techniques that leverage the characteristics of various applications to optimize memory fault tolerance in highly scaled memory technologies. This thesis advocates a system-aided scaling of memory and data-dependent error-tolerance design strategy that allows memory chips to provide erroneous bits. These erroneous bits are directly visible to and tolerated by system-level memory controller instead of memory chips themselves. This design is evaluated in the case of using DRAM and STT-RAM in solid-state drives (SSDs). By dynamically and jointly adjusting ECC configurations, the memory controller is able to adapt to the runtime data access characteristics. This technology contributes significant ECC redundancy saving and data reliability improvement.;3D memory chip stacking is also a promising technology which is an entirely new category of high-performance memory, delivering unprecedented system performance and bandwidth. Although the emerging 3D DRAM products can significantly improve the computing system performance, the relatively high cost is one of the most critical issues that prevent their wide real-life adoption. Fortunately, system-aided DRAM scaling can very naturally fit the emerging 3D DRAM-controller integrated chips such as the hybrid memory cube (HMC). Under such a system-aided DRAM scaling design framework, the most crucial challenge is how to most effectively compensate the memory errors caused by the erroneous cells at minimal overheads in terms of data access latency and redundancy. Conventional ECC designs for memory focus on random errors while paying no attention to the feature of error patterns introduced by weak cells. Design strategy proposed in this thesis can tolerate the weak cell rate of as high as 10-4 and 6x10-5 if 100% and 90% of all the weak cells are known in prior. Using Micron's HMC 3D DRAM chips as the test vehicle, the evaluated implementation results show that it only consumes less than 0.4mm 2 (45nm node) on the logic die. Using CPU and DRAM simulators, simulations are further carried out over a variety of computing benchmarks and the results show that this design solution only incurs less than 2% performance degradation on average.;Besides hardware based strategies, this thesis also presents a software-based solution on the use of DRAM with unrepaired weak cells in computing systems. The solution is based on the simple idea that operating system (OS) reserves all the error-prone pages, which contain at least one unrepaired weak cell, from being used. Under a relatively high error-prone page rate (e.g., 8%), it is almost impossible for OS to allocate a continuous fragmentation-free physical memory space for some critical operations. Moreover, reserving all the error-prone pages from practical usage could cause noticeable memory resource waste. Aiming to address these issues, this thesis presents a controller-based selective page remapping strategy to ensure a continuous critical memory region for OS and develops a software-based memory error tolerance scheme to recycle all the error-prone pages for the zRAM function in Linux. Experiments are carried out using SPEC CPU2006 and further study is performed on the latency, hardware cost and the effectiveness of recycling error-prone pages for zRAM in Linux. The experimental results show that the proposed software-based error tolerance scheme degrades the speed performance of zRAM by only up to 7%.
机译:DRAM(动态随机存取存储器)技术已经为计算行业加油了近五十年,并且在实现现代信息技术基础架构中起着至关重要的作用。但是,随着DRAM技术的规模接近20纳米及以下,保持历史位成本的降低变得越来越具有挑战性。特别是,随着DRAM技术扩展到20nm以下,获得足够的DRAM数据保留时间变得越来越困难。具有相对较短保留时间的DRAM单元被称为弱单元,并且可能无法在某些刷新周期(例如,当前实践中为64ms或128ms)中保持存储的数据。因此,已经致力于寻找替代的存储技术。几种新兴的存储技术已被认为是很有前途的候选技术,例如,自旋转移转矩(STT)RAM和相变存储器(PCM)。尽管这些新兴的内存技术在扩展方面可能具有优势,但它们不可避免地面临成本,容量和可靠性方面的挑战。在常规实践中,所有错误的存储单元都被冗余修复和错误控制代码(ECC)掩盖,而这些错误和修复代码对外部是不可见的。然而,将这种设计理念保持在20 nm以下区域对于存储器行业来说是不切实际的。本文提出了一系列正交存储器系统设计技术,这些技术利用各种应用程序的特性来优化高度扩展的存储器技术中的存储器容错能力。本文提出了一种系统辅助的内存缩放和与数据相关的容错设计策略,该策略允许内存芯片提供错误的位。这些错误位对于系统级内存控制器(而不是内存芯片本身)直接可见并可以容忍。在固态驱动器(SSD)中使用DRAM和STT-RAM的情况下,可以评估该设计。通过动态和共同地调整ECC配置,存储控制器能够适应运行时数据访问特征。该技术极大地节省了ECC冗余并提高了数据可靠性。3D内存芯片堆叠也是一项很有前途的技术,它是高性能内存的全新类别,可提供前所未有的系统性能和带宽。尽管新兴的3D DRAM产品可以显着提高计算系统的性能,但相对较高的成本却是阻碍其在现实生活中广泛采用的最关键问题之一。幸运的是,系统辅助的DRAM缩放可以非常自然地适应新兴的3D DRAM控制器集成芯片,例如混合存储立方体(HMC)。在这种系统辅助的DRAM缩放设计框架下,最关键的挑战是如何在数据访问延迟和冗余方面以最小的开销最有效地补偿由错误单元导致的存储错误。用于存储器的常规ECC设计专注于随机错误,而没有注意弱单元所引入的错误模式的特征。如果事先知道所有弱细胞的100%和90%,则本文提出的设计策略可以耐受高达10-4和6x10-5的弱细胞率。使用美光的HMC 3D DRAM芯片作为测试工具,评估后的实施结果表明,该芯片在逻辑芯片上的功耗仅不到0.4mm 2(45nm节点)。使用CPU和DRAM仿真器,对各种计算基准进行了进一步的仿真,结果表明,该设计解决方案平均只会使性能下降不到2%。;除了基于硬件的策略之外,本文还提出了一种基于软件的方法。 DRAM与未修复的弱单元在计算系统中的使用的解决方案。该解决方案基于一个简单的想法,即操作系统(OS)保留所有容易出错的页面,其中至少包含一个未修复的弱单元。在容易出错的页面速率较高(例如8%)下,操作系统几乎不可能为某些关键操作分配连续无碎片的物理内存空间。此外,从实际使用中保留所有容易出错的页面可能会导致明显的内存资源浪费。为了解决这些问题,本文提出了一种基于控制器的选择性页面重映射策略,以确保操作系统的连续关键内存区域,并开发了一种基于软件的内存容错方案,以回收Linux中zRAM功能的所有容易出错的页面。使用SPEC CPU2006进行了实验,并进一步研究了延迟,硬件成本以及Linux中zRAM易错页面的回收有效性。实验结果表明,基于软件的容错方案仅使zRAM的速度性能降低了7%。

著录项

  • 作者

    Wang, Hao.;

  • 作者单位

    Rensselaer Polytechnic Institute.;

  • 授予单位 Rensselaer Polytechnic Institute.;
  • 学科 Computer engineering.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 108 p.
  • 总页数 108
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号