Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Di Sheng; Guo Hanqi; Gupta Rinku; Pershey Eric R.; Snir Marc; Cappello Franck

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

【24h】

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

机译：探索大型HPC系统中致命事件的性质和相关性

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we explore potential correlations of fatal system events for one of the most powerful supercomputers-IBM Blue Gene/Q Mira, which is deployed at Argonne National Laboratory, based on its 5-year reliability, availability, and serviceability (RAS) log. Our contribution is two-fold. (1) We design an efficient log analysis tool, namely LogAider, with a novel filtering method to effectively extract fatal events from masses of system messages that are heavily duplicated in the log. LogAider exhibits a very precise detection of temporal-correlation with a high similarity (up to 95 percent) to the ground-truth (i.e., compared to the failure records reported by the administrators). The total number of fatal events can be reduced to about 1,255 compared with originally 2.6 million duplicated fatal messages. (2) We analyze the 5-year RAS log of the MIRA system using LogAider, and summarize six important "takeaways" which can help system vendors and administrators better understand an extreme-scale system's fatal events. Specifically, we find that the distribution or proportion of the fatal system events follow a Pareto-like principle in general. The temporal correlation among fatal events is much stronger than that of warn messages and info messages, and the correlated events tend to constitute a few clusters. The mean time between fatal events (MTBFE) of the Mira system is about 1.3 days from the perspective of the system, and the MTTI is 2-4 days from the perspective of users. The most error-prone item value with respect to any key attribute appears likely in the log every 2-10 days. Weibull, Gamma, and Pearson6 are the three best-fit distributions for the fatal event intervals. The overall correlation of fatal events on the 5D torus network is not prominent, whereas the small-region locality correlation (e.g., the fatal events inside racks) is relatively strong. We believe our work will be interesting to large-scale HPC system administrators and vendors and to fault tolerance researchers, enabling them to better understand fatal events and mitigate such events accordingly.

机译：在本文中，我们基于5年可靠性，可用性和可维护性（RAS），探讨了功能最强大的超级计算机之一IBM Blue Gene / Q Mira在致命系统事件中的潜在相关性，IBM Blue Gene / Q Mira已部署在Argonne国家实验室中日志。我们的贡献是双重的。（1）我们设计了一种有效的日志分析工具LogAider，它具有一种新颖的过滤方法，可以从大量在日志中大量重复的系统消息中有效地提取致命事件。 LogAider展示了非常精确的时间相关性检测，与地面真相具有高度相似性（高达95％）（即与管理员报告的故障记录相比）。与最初的260万条重复致命消息相比，致命事件的总数可以减少到约1,255。（2）我们使用LogAider分析MIRA系统的5年RAS日志，并总结了六个重要的“要点”，它们可以帮助系统供应商和管理员更好地理解极端规模系统的致命事件。具体而言，我们发现致命系统事件的分布或比例总体上遵循帕累托式原理。致命事件之间的时间相关性比警告消息和信息消息中的时间相关性强得多，并且相关事件倾向于构成一些群集。从系统角度来看，Mira系统的致命事件平均间隔时间（MTBFE）约为1.3天，从用户角度来看，MTTI的平均间隔为2-4天。关于任何关键属性，最容易出错的项目值可能每2-10天出现在日志中。 Weibull，Gamma和Pearson6是致命事件间隔的三个最佳拟合分布。 5D环面网络上致命事件的整体相关性并不突出，而小区域局部性相关性（例如机架内部的致命事件）则相对较强。我们认为，我们的工作对大型HPC系统管理员和供应商以及容错研究人员而言将是有趣的，使他们能够更好地了解致命事件并相应地减轻此类事件的发生。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2019年第2期|361-374|共14页
作者
Di Sheng; Guo Hanqi; Gupta Rinku; Pershey Eric R.; Snir Marc; Cappello Franck;
展开▼
作者单位

Argonne Natl Lab, MCS, Argonne, IL 60439 USA;

Argonne Natl Lab, MCS, Argonne, IL 60439 USA;

Argonne Natl Lab, MCS, Argonne, IL 60439 USA;

Argonne Natl Lab, MCS, Argonne, IL 60439 USA;

Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA;

Argonne Natl Lab, MCS, Argonne, IL 60439 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Peta-scale supercomputer; mining correlations; fatal event analysis; reliability-availability-serviceability (RAS);

机译：千万亿级超级计算机;挖掘关联;致命事件分析;可靠性-可用性-可服务性（RAS）;

相似文献

外文文献
中文文献
专利

1. FatTreeSim: Modeling Large-scale Fat-Tree Networks for HPC Systems and Data Centers Using Parallel and Discrete Event Simulation [J] . Ning Liu, Adnan Haider, Xian-He Sun, Proceedings of the Workshop on Principles of Advanced and Distributed Simulation . 2015,第CDaROM期

机译：FatTreeSim：使用并行和离散事件仿真为HPC系统和数据中心建模大型胖树网络
2. Exploring the relationship between the optical properties of water and the quality and quantity of dissolved organic carbon in aquatic ecosystems: strong correlations do not always mean strong predictive power [J] . Baldwin Darren S., Valo William Environmental Science: Processes & Impacts . 2015,第3期

机译：探索水的光学特性与水生生态系统中溶解的有机碳的质量和数量之间的关系：强相关并不总是意味着强大的预测能力
3. Estimating the Cross-Correlation Properties of Large-Scale Parameters in Multilink Distributed Antenna Systems: Synchronous Measurements Versus Repeated Measurements [J] . Ghassan Dahman, Jose Flordelis, Fredrik Tufvesson IEEE Transactions on Vehicular Technology . 2017,第9期

机译：估算多链路分布式天线系统中大型参数的互相关特性：同步测量与重复测量
4. Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems [C] . Chhugani Jatin, Kim Changkyu, Shukla Hemant, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. . 2012

机译：大规模HPC集群系统上十亿粒子SIMD友好的两点关联
5. Exploring Novel Burst Buffer Management on Extreme-Scale HPC Systems. [D] . Wang, Teng. 2017

机译：在极端规模的HPC系统上探索新颖的突发缓冲区管理。
6. A large-scale investigation of alcohol-based handrub (ABHR) volume: hand coverage correlations utilizing an innovative quantitative evaluation system [O] . Constantinos Voniatis, Száva Bánsághi, Andrea Ferencz, 2021

机译：利用创新的定量评估系统对酒精的Handrub（ABHR）体积（ABHR）的大规模研究
7. Adaptive Event Prediction Strategy with Dynamic Time Window for Large-Scale HPC Systems [O] . Ana Gainaru, Joshi Fullop, Stefan Trausan-matu, 2014

机译：具有动态时间窗的自适应事件预测策略用于大规模HpC系统
8. Optimisation and Validation of the ARAMIS Digital Image Correlation System for Use in Large-Scale High-Strain-Rate Events. [R] . V. Pickerd 2013

机译：用于大规模高应变率事件的aRamIs数字图像相关系统的优化和验证。

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

摘要

著录项

相似文献

相关主题

期刊订阅