...
首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System
【24h】

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

机译:探索大型HPC系统中致命事件的性质和相关性

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In this paper, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.
机译:在本文中,我们基于5年可靠性,可用性和可维护性(RAS),探讨了功能最强大的超级计算机之一IBM Blue Gene / Q Mira在致命系统事件中的潜在相关性,IBM Blue Gene / Q Mira已部署在Argonne国家实验室中日志。我们的贡献是双重的。 (1)我们设计了一种有效的日志分析工具LogAider,它具有一种新颖的过滤方法,可以从大量在日志中大量重复的系统消息中有效地提取致命事件。 LogAider展示了非常精确的时间相关性检测,与地面真相具有高度相似性(高达95%)(即与管理员报告的故障记录相比)。与最初的260万条重复致命消息相比,致命事件的总数可以减少到约1,255。 (2)我们使用LogAider分析MIRA系统的5年RAS日志,并总结了六个重要的“要点”,它们可以帮助系统供应商和管理员更好地理解极端规模系统的致命事件。具体而言,我们发现致命系统事件的分布或比例总体上遵循帕累托式原理。致命事件之间的时间相关性比警告消息和信息消息中的时间相关性强得多,并且相关事件倾向于构成一些群集。从系统角度来看,Mira系统的致命事件平均间隔时间(MTBFE)约为1.3天,从用户角度来看,MTTI的平均间隔为2-4天。关于任何关键属性,最容易出错的项目值可能每2-10天出现在日志中。 Weibull,Gamma和Pearson6是致命事件间隔的三个最佳拟合分布。 5D环面网络上致命事件的整体相关性并不突出,而小区域局部性相关性(例如机架内部的致命事件)则相对较强。我们认为,我们的工作对大型HPC系统管理员和供应商以及容错研究人员而言将是有趣的,使他们能够更好地了解致命事件并相应地减轻此类事件的发生。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号