首页> 外文会议>European Dependable Computing Conference >Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data
【24h】

Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data

机译:预测不可纠正的内存错误以进行主动替换:基于大规模现场数据的实证研究

获取原文

摘要

Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.
机译:不可纠正的内存错误是数据中心服务器故障的主要原因。使用历史可纠正错误(CE)信息预测不可纠正错误(UE)有助于在灾难性事件发生之前主动更换内存硬件。在本文中,我们在8个月的时间里对来自腾讯数据中心30,000多台现代服务器的大规模现场数据进行UE预测的实证研究。我们证明了基于CE率的传统方法在低精度下效果不佳。然后,我们利用详细的微观CE信息来设计几个新的预测变量。对比研究表明,基于色谱柱故障识别的新预测器将基准精度提高了300%以上,同时还大大提高了基准召回率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号