首页> 外文会议>Annual IEEE/IFIP International Conference on Dependable Systems and Networks >Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System
【24h】

Characterizing and Understanding HPC Job Failures Over The 2K-Day Life of IBM BlueGene/Q System

机译:在IBM BlueGene / Q系统的2K天生命周期内表征和了解HPC作业失败

获取原文

摘要

An in-depth understanding of the failure features of HPC jobs in a supercomputer is critical to the large-scale system maintenance and improvement of the service quality for users. In this paper, we investigate the features of hundreds of thousands of jobs in one of the most powerful supercomputers, the IBM Blue Gene/Q Mira, based on 2001 days of observations with a total of over 32.44 billion core-hours. We study the impact of the system's events on the jobs' execution in order to understand the system's reliability from the perspective of jobs and users. The characterization involves a joint analysis based on multiple data sources, including the reliability, availability, and serviceability (RAS) log; job scheduling log; the log regarding each job's physical execution tasks; and the I/O behavior log. We present 22 valuable takeaways based on our in-depth analysis. For instance, 99,245 job failures are reported in the job-scheduling log, a large majority (99.4%) of which are due to user behavior (such as bugs in code, wrong configuration, or misoperations). The job failures are correlated with multiple metrics and attributes, such as users/projects and job execution structure (number of tasks, scale, and core-hours). The best-fitting distributions of a failed job's execution length (or interruption interval) include Weibull, Pareto, inverse Gaussian, and Erlang/exponential, depending on the types of errors (i.e., exit codes). The RAS events affecting job executions exhibit a high correlation with users and core-hours and have a strong locality feature. In terms of the failed jobs, our similarity-based event-filtering analysis indicates that the mean time to interruption is about 3.5 days.
机译:深入了解超级计算机中HPC作业的故障特征对于大规模系统维护和提高用户服务质量至关重要。在本文中,我们根据2001年的观察结果(共超过324.4亿个核心小时),调查了功能最强大的超级计算机之一IBM Blue Gene / Q Mira中成千上万个工作的特征。我们研究系统事件对作业执行的影响,以便从作业和用户的角度了解系统的可靠性。表征涉及基于多个数据源的联合分析,包括可靠性,可用性和可维护性(RAS)日志;作业计划日志;有关每个作业的物理执行任务的日志;和I / O行为日志。根据我们的深入分析,我们提出22个有价值的外卖食品。例如,在作业计划日志中报告了99,245个作业失败,其中大部分(99.4%)是由于用户行为(例如代码中的错误,错误的配置或错误的操作)引起的。作业失败与多个指标和属性相关,例如用户/项目和作业执行结构(任务数,规模和核心时间)。失败作业的执行长度(或中断间隔)的最合适分布包括Weibull,Pareto,反高斯和Erlang /指数,具体取决于错误的类型(即退出代码)。影响工作执行的RAS事件与用户和核心时间具有高度相关性,并且具有很强的局部性。对于失败的作业,我们基于相似度的事件过滤分析表明,平均中断时间约为3.5天。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号