This work concerns appropriate metrics for evaluating microarchitectural enhancements to improve processor lifetime reliability. The most commonly used reliability metric is mean time to failure (MTTF). However, MTTF does not provide information on the reliability characteristics during the typical operational life of a processor, which is usually much shorter than the MTTF. An alternative to MTTF that provides more information to both the designer and the user is the time to failure of a small percentage, say n%, of the population, denoted by tn . Determining tn , however, requires knowledge of the distribution of processor failure times which is generally hard to obtain. In this paper, we show (1) how tn can be obtained and incorporated within previous architecture-level lifetime reliability tools, (2) how tn relates to MTTF using state-of-the-art reliability models, and (3) the impact of using MTTF instead of tn on reliability-aware design.We perform our evaluation using RAMP 2.0, a state-of-the-art architecture-level tool for lifetime reliability measurements. Our analysis shows that no clear relationship between tn and MTTF is apparent across several architectures. Two populations with the same MTTF may have different tn , resulting in a difference in the number of failures in the same operational period. MTTF fails to capture such behavior and can thus be misleading. Further, when designing reliability-aware systems, using improvements in MTTF as a proxy for improvements in tn can lead to poor design choices. Depending on the application and the system, MTTF-driven designs may be over-designed (incurring unnecessary cost or performance overhead) or under-designed (failing to meet the required tn reliability target).
展开▼