Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System

机译：在IBM BlueGene / Q系统的2K天生命周期内表征和了解HPC作业失败

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

An in-depth understanding of the failure features of HPC jobs in a supercomputer is critical to the large-scale system maintenance and improvement of the service quality for users. In this paper, we investigate the features of hundreds of thousands of jobs in one of the most powerful supercomputers, the IBM Blue Gene/Q Mira, based on 2001 days of observations with a total of over 32.44 billion core-hours. We study the impact of the system's events on the jobs' execution in order to understand the system's reliability from the perspective of jobs and users. The characterization involves a joint analysis based on multiple data sources, including the reliability, availability, and serviceability (RAS) log; job scheduling log; the log regarding each job's physical execution tasks; and the I/O behavior log. We present 22 valuable takeaways based on our in-depth analysis. For instance, 99,245 job failures are reported in the job-scheduling log, a large majority (99.4%) of which are due to user behavior (such as bugs in code, wrong configuration, or misoperations). The job failures are correlated with multiple metrics and attributes, such as users/projects and job execution structure (number of tasks, scale, and core-hours). The best-fitting distributions of a failed job's execution length (or interruption interval) include Weibull, Pareto, inverse Gaussian, and Erlang/exponential, depending on the types of errors (i.e., exit codes). The RAS events affecting job executions exhibit a high correlation with users and core-hours and have a strong locality feature. In terms of the failed jobs, our similarity-based event-filtering analysis indicates that the mean time to interruption is about 3.5 days.

机译：深入了解超级计算机中HPC作业的故障特征对于大规模系统维护和提高用户服务质量至关重要。在本文中，我们根据2001年的观察结果（共超过324.4亿个核心小时），调查了功能最强大的超级计算机之一IBM Blue Gene / Q Mira中成千上万个工作的特征。我们研究系统事件对作业执行的影响，以便从作业和用户的角度了解系统的可靠性。表征涉及基于多个数据源的联合分析，包括可靠性，可用性和可维护性（RAS）日志;作业计划日志;有关每个作业的物理执行任务的日志;和I / O行为日志。根据我们的深入分析，我们提出22个有价值的外卖食品。例如，在作业计划日志中报告了99,245个作业失败，其中大部分（99.4％）是由于用户行为（例如代码中的错误，错误的配置或错误的操作）引起的。作业失败与多个指标和属性相关，例如用户/项目和作业执行结构（任务数，规模和核心时间）。失败作业的执行长度（或中断间隔）的最合适分布包括Weibull，Pareto，反高斯和Erlang /指数，具体取决于错误的类型（即退出代码）。影响工作执行的RAS事件与用户和核心时间具有高度相关性，并且具有很强的局部性。对于失败的作业，我们基于相似度的事件过滤分析表明，平均中断时间约为3.5天。

著录项

来源
《Annual IEEE/IFIP International Conference on Dependable Systems and Networks》|2019年|473-484|共12页
会议地点
作者
Sheng Di; Hanqi Guo; Eric Pershey; Marc Snir; Franck Cappello;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Supercomputers; Task analysis; Correlation; Reliability; Cobalt; Large-scale systems; Resilience;

机译：超级计算机;任务分析;相关性;可靠性;钴;大型系统;弹性;

相似文献

外文文献
中文文献
专利

1. Topology-aware Job Allocation in 3D Torus-based HPC Systems with Hard Job Priority Constraints [J] . Kangkang Li, Maciej Malawski, Jarek Nabrzyski Procedia Computer Science . 2017,第1期

机译：具有硬作业优先级约束的基于3D Torus的HPC系统中的拓扑感知作业分配
2. Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems [J] . Agelastos Anthony, Allan Benjamin, Brandt Jim, Parallel Computing . 2016,第octa期

机译：持续进行全系统监控，以快速了解生产HPC应用程序和系统
3. A Value-Oriented Job Scheduling Approach for Power-Constrained and Oversubscribed HPC Systems [J] . Nirmal Kumbhare, Aniruddha Marathe, Ali Akoglu, Parallel and Distributed Systems, IEEE Transactions on . 2020,第6期

机译：用于功耗和超额订立HPC系统的值取向的作业调度方法
4. Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System [C] . Sheng Di, Hanqi Guo, Eric Pershey, Annual IEEE/IFIP International Conference on Dependable Systems and Networks . 2019

机译：在IBM Bluegene / Q系统的2K日寿命中表征和理解HPC作业故障
5. Modeling and simulation of HPC systems through job scheduling analysis [D] . Hurst, William B. 2011

机译：通过作业调度分析对HPC系统进行建模和仿真
6. Understanding Search Failures in Consumer Health Information Systems [O] . Alexa T. McCray, Tony Tse 2003

机译：了解消费者健康信息系统中的搜索失败
7. 3D Neutron Transport and HPC: A PWR Full Core Calculation Using PENTRAN SN Code and IBM BLUEGENE/P Computers [O] . Tanguy COURAU, Glenn SJODEN 2011

机译：3D中子传输和HPC：使用PENTran SN代码和IBM Bluegene / P计算机的PWR全核计算

Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System

摘要

著录项

相似文献

相关主题

期刊订阅