【24h】

Introducing a Reliability Analysis Framework for High Performance Computing Environments

机译:介绍用于高性能计算环境的可靠性分析框架

获取原文

摘要

Supercomputing environments are becoming the norm for daily use. However, their complex infrastructure makes troubleshooting and monitoring failures extremely difficult. This is because these infrastructures contain thousands of nodes representing various applications and processors. To address these concerns, we propose a real-time reliability analysis framework for high performance computing (HPC) environments where the contributions are three-fold. First, an improved data network extrapolation (DNE) methodology is proposed as a pre-processing module. This component incorporates the system failure information (i.e. job, fault, and error log files) and performs robust job-based failure accounting for sequential and parallel jobs. This element also performs cross-referencing to compute task-based failure accounting, where the assumption is made that tasks are comprised of either one or more jobs. Next, a reliability characterization and analysis (RCA) schema is proposed that takes the failure information from the DNE process to perform survival analyses on each individual node in addition to the entire reliability infrastructure. This is coupled with a failure metrics characterization (FMC) schema that estimates the failure metrics such as the mean time to failure (MTTF) as well as the hazard rate. Additionally, a comparative analysis is made between the Log-Normal or Weibull distributions in terms of modeling job and task-based failure activity. Empirical analysis using the Structural Simulation Toolkit (SST) illustrate the promise of this approach in terms of characterizing, monitoring, and troubleshooting failure behavior. The results of this work can aide systems administrators the dynamic tools to pinpoint and monitor failure behavior; its impacts; and alternative job-scheduling policies without interrupting production processes.
机译:超级计算环境已成为日常使用的规范。但是,其复杂的基础架构使故障排除和监视故障极为困难。这是因为这些基础结构包含代表各种应用程序和处理器的数千个节点。为了解决这些问题,我们提出了针对高性能计算(HPC)环境的实时可靠性分析框架,该框架的贡献是三倍。首先,提出了一种改进的数据网络外推(DNE)方法作为预处理模块。此组件合并了系统故障信息(即作业,故障和错误日志文件),并针对顺序和并行作业执行了基于作业的可靠故障统计。该元素还执行交叉引用,以计算基于任务的故障统计,其中假设任务由一个或多个作业组成。接下来,提出了一种可靠性表征和分析(RCA)方案,该方案利用DNE流程中的故障信息,除了整个可靠性基础架构之外,还对每个单独的节点执行生存分析。这与故障度量特征(FMC)模式结合在一起,该模式可估计故障度量,例如平均故障时间(MTTF)和危险率。此外,在建模工作和基于任务的故障活动方面,对数正态分布或Weibull分布之间进行了比较分析。使用结构仿真工具包(SST)进行的经验分析从表征,监视故障行为和对其进行故障排除的角度说明了该方法的前景。这项工作的结果可以帮助系统管理员使用动态工具来查明和监视故障行为。其影响;以及可选的工作计划策略,而不会中断生产过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号