【24h】

Doomsday: Predicting Which Node Will Fail When on Supercomputers

机译:世界末日:预测超级计算机上的哪个节点将发生故障

获取原文
获取原文并翻译 | 示例

摘要

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.
机译:预测哪个节点将发生故障以及在多长时间后仍将是HPC弹性的挑战,但可能为在工作失败之前利用主动补救措施铺平道路。不仅要增加到万亿级系统的可伸缩性,甚至对于当代的超级计算机体系结构,都需要付出大量的努力才能从嘈杂的原始日志中提取异常事件。为此,我们提出了一种新颖的短语提取机制,称为TBP(基于时间的短语),以查明节点故障,这是前所未有的。我们基于真实系统数据和统计机器学习的研究表明,预测Cray系统中哪个特定节点将发生故障的可行性。 TBP实现了不少于83%的召回率,交货时间高达2分钟。通常,这为增强超级计算系统的预测交货时间打开了大门,从而促进了大规模生产系统中计算能力和功能的有效利用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号