Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content

机译：通过利用当前内存内容来检测和缓解数据相关的DRAM故障

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65-74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.

机译：密切接近的DRAM单元可以根据邻居单元中的数据内容失败。这些失败称为数据相关的故障。在线检测和缓解这些故障，而系统在该字段中运行，可以提高系统的可靠性，延迟和能效的各种优化。例如，系统可以通过使用大多数单元格的较低刷新率来提高性能和能量效率，并使用更高的刷新速率或纠错码减轻失败的单元格。所有这些系统优化都依赖于准确地检测可能在DRAM中的任何内容中发生的每个可能的数据相关的故障。不幸的是，检测所有数据相关的失败需要了解每个DRAM芯片的DRAM内部结构。由于内部DRAM架构未暴露于系统，检测系统级别的数据依赖失败是一项重大挑战。在本文中，我们与物理DRAM组织的数据依赖失败进行了解耦和减轻，使得这是可能的在没有DRAM内部的知识的情况下检测失败。为此，我们提出Memcon，基于内存内容的检测和缓解机制，用于DRAM中的数据依赖失败。 MEMCON不会检测到每个可能的数据相关的故障。相反，它检测和减轻仅在程序中运行时在内存中的当前内容发生的故障。这种机制需要在有一个改变内存内容的写入访问时检测失败。由于使用运行时测试的故障检测具有高开销，因此MEMCON选择性地在写入时启动测试，仅当两个连续写入该页面（即写入间隔）之间的时间足够长，以通过降低刷新来提供显着的益处在该间隔期间的速率。 Memcon基于简单实用的机制构建，该机制基于我们的观察来预测长写入间隔，即实际工作负载中的写入间隔遵循帕累托分布：在写入后的页面仍然闲置的时间越长，预计将保持空闲越长。我们的评估表明，与使用攻击性刷新率的系统相比，MEMCON将刷新操作减少65-74％，导致10％/ 17％/ 40％（min）至12％/ 22％/ 50％（最大值）使用8/16/32 GB DRAM芯片的4核系统，单核和10％/ 23％/ 52％（最小）的性能提高至17％/ 29％/ 65％（最大）性能改进。

著录项

来源
《International Symposium on Microarchitecture》|2017年|xix 825 p. :|共14页
会议地点
作者
Samira Khan; Chris Wilkerson; Zhe Wang; Alaa R. Alameldeen; Donghyuk Lee; Onur Mutlu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
DRAM chips; error correction codes; integrated circuit reliability; multiprocessing systems; Pareto distribution;

机译：DRAM芯片;纠错码;集成电路可靠性;多处理系统;帕累托分布;

相似文献

外文文献
中文文献
专利

1. Reliability of Memories Built From Unreliable Components Under Data-Dependent Gate Failures [J] . Brkic Srdan, Ivanis Predrag, Vasic Bane Communications Letters, IEEE . 2015,第12期

机译：在数据相关的门故障下，由不可靠组件构建的内存的可靠性
2. The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions [J] . Minesh Patel, Jeremie S. Kim, Onur Mutlu Computer architecture news . 2017,第2期

机译：到达状况分析器（REAPER）：通过在恶劣条件下进行性能分析来缓解DRAM保持故障
3. EXTREME: Exploiting Page Table for Reducing Refresh Power of 3D-Stacked DRAM Memory [J] . Ho Hyun Shin, Young Min Park, Duheon Choi, IEEE Transactions on Computers . 2018,第1期

机译：EXTREME：利用页表来降低3D堆叠DRAM存储器的刷新能力
4. Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content [C] . Samira Khan, Chris Wilkerson, Zhe Wang, Annual IEEE/ACM International Symposium on Microarchitecture . 2017

机译：通过利用当前的内存内容来检测和缓解与数据相关的DRAM故障
5. Power-saving method for DRAM/eDRAM and 3D-DRAM exploiting the process variations, temperature changes, device degradation, and memory access workload variations and innovative heterogeneous memory management approach using 3D-DRAM with Quality of Service. [D] . Tran, Le-Nguyen. 2013

机译：DRAM / eDRAM和3D-DRAM的省电方法，利用工艺变化，温度变化，设备降级和内存访问工作负载变化，以及使用具有服务质量的3D-DRAM的创新的异构存储管理方法。
6. DRAMS: A tool to detect and re-align mixed-up samples for integrative studies of multi-omics data [O] . Yi Jiang, Gina Giase, Kay Grennan, 2020

机译：DRAMS：一种用于检测和重新排列混合样本以进行多组学数据集成研究的工具
7. The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study [O] . Samira Khan, Donghyuk Lee, Yoongu Kim, 2014

机译：DRam保留失效的误差缓解技术的有效性：比较实验研究

Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅