【24h】

DC-Prophet: Predicting Catastrophic Machine Failures in DataCenters

机译:DC先知:预测数据中心中的灾难性机器故障

获取原文

摘要

When will a server fail catastrophically in an industrial datacenter? Is it possible to forecast these failures so preventive actions can be taken to increase the reliability of a datacenter? To answer these questions, we have studied what are probably the largest, publicly available datacenter traces, containing more than 104 million events from 12,500 machines. Among these samples, we observe and categorize three types of machine failures, all of which are catastrophic and may lead to information loss, or even worse, reliability degradation of a data-center. We further propose a two-stage framework-DC-Prophet (DC-Prophet stands for DataCenter-Prophet.)-based on One-Class Support Vector Machine and Random Forest. DC-Prophet extracts surprising patterns and accurately predicts the next failure of a machine. Experimental results show that DC-Prophet achieves an AUC of 0.93 in predicting the next machine failure, and a F_3-score (The ideal value of F_3-score is 1, indicating perfect predictions. Also, the intuition behind F_3-score is to value "Recall" about three times more than "Precision" [12].) of 0.88 (out of 1). On average, DC-Prophet outperforms other classical machine learning methods by 39.45% in F_3-score.
机译:服务器何时会在工业数据中心发生灾难性的故障?是否可以预测这些故障,以便采取预防措施来提高数据中心的可靠性?为了回答这些问题,我们研究了可能是最大的公开数据中心跟踪,其中包含来自12,500台计算机的1.04亿个事件。在这些样本中,我们观察并分类了三种类型的机器故障,所有这些都是灾难性的,并且可能导致信息丢失,甚至导致数据中心的可靠性下降。我们进一步提出了基于一类支持向量机和随机森林的两阶段框架-DC-Prophet(DC-Prophet代表DataCenter-Prophet。)。 DC-Prophet提取出令人惊讶的模式并准确预测机器的下一次故障。实验结果表明,DC-Prophet在预测下一台机器故障时达到了0.93的AUC,而F_3-分数(F_3-分数的理想值为1,表明预测是完美的。此外,F_3-分数的直觉是价值“召回率”大约是“精度” [12]的三倍,为0.88(满分1)。在F_3评分中,DC-Prophet平均优于其他经典机器学习方法39.45%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号