首页> 外文会议>International Symposium on Microarchitecture >Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content
【24h】

Detecting and Mitigating Data-Dependent DRAM Failures by Exploiting Current Memory Content

机译:通过利用当前内存内容来检测和缓解数据相关的DRAM故障

获取原文
获取外文期刊封面目录资料

摘要

DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online, while the system is running in the field, enables various optimizations that improve reliability, latency, and energy efficiency of the system. For example, a system can improve performance and energy efficiency by using a lower refresh rate for most cells and mitigate the failing cells using higher refresh rates or error correcting codes. All these system optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge.In this paper, we decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON, a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur only with the current content in memory while the programs are running in the system. Such a mechanism needs to detect failures whenever there is a write access that changes the content of memory. As detection of failure with a runtime testing has a high overhead, MEMCON selectively initiates a test on a write, only when the time between two consecutive writes to that page (i.e., write interval) is long enough to provide significant benefit by lowering the refresh rate during that interval. MEMCON builds upon a simple, practical mechanism that predicts the long write intervals based on our observation that the write intervals in real workloads follow a Pareto distribution: the longer a page remains idle after a write, the longer it is expected to remain idle. Our evaluation shows that compared to a system that uses an aggressive refresh rate, MEMCON reduces refresh operations by 65-74%, leading to a 10%/17%/40% (min) to 12%/22%/50% (max) performance improvement for a single-core and 10%/23%/52% (min) to 17%/29%/65% (max) performance improvement for a 4-core system using 8/16/32 Gb DRAM chips.
机译:密切接近的DRAM单元可以根据邻居单元中的数据内容失败。这些失败称为数据相关的故障。在线检测和缓解这些故障,而系统在该字段中运行,可以提高系统的可靠性,延迟和能效的各种优化。例如,系统可以通过使用大多数单元格的较低刷新率来提高性能和能量效率,并使用更高的刷新速率或纠错码减轻失败的单元格。所有这些系统优化都依赖于准确地检测可能在DRAM中的任何内容中发生的每个可能的数据相关的故障。不幸的是,检测所有数据相关的失败需要了解每个DRAM芯片的DRAM内部结构。由于内部DRAM架构未暴露于系统,检测系统级别的数据依赖失败是一项重大挑战。在本文中,我们与物理DRAM组织的数据依赖失败进行了解耦和减轻,使得这是可能的在没有DRAM内部的知识的情况下检测失败。为此,我们提出Memcon,基于内存内容的检测和缓解机制,用于DRAM中的数据依赖失败。 MEMCON不会检测到每个可能的数据相关的故障。相反,它检测和减轻仅在程序中运行时在内存中的当前内容发生的故障。这种机制需要在有一个改变内存内容的写入访问时检测失败。由于使用运行时测试的故障检测具有高开销,因此MEMCON选择性地在写入时启动测试,仅当两个连续写入该页面(即写入间隔)之间的时间足够长,以通过降低刷新来提供显着的益处在该间隔期间的速率。 Memcon基于简单实用的机制构建,该机制基于我们的观察来预测长写入间隔,即实际工作负载中的写入间隔遵循帕累托分布:在写入后的页面仍然闲置的时间越长,预计将保持空闲越长。我们的评估表明,与使用攻击性刷新率的系统相比,MEMCON将刷新操作减少65-74%,导致10%/ 17%/ 40%(min)至12%/ 22%/ 50%(最大值)使用8/16/32 GB DRAM芯片的4核系统,单核和10%/ 23%/ 52%(最小)的性能提高至17%/ 29%/ 65%(最大)性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号