首页> 外文会议>2018 18th International Conference on Computational Science and Applications >Exploiting the behavior of the failed job in high performance computing system
【24h】

Exploiting the behavior of the failed job in high performance computing system

机译:利用高性能计算系统中失败作业的行为

获取原文
获取原文并翻译 | 示例

摘要

As demand for high-performance computing power is increasing, operation management technologies like check-pointing, failure-aware task scheduling, and system simulations are becoming more important for the stable operation of the system. To maintain and manage a stable system, a detailed analysis of failed tasks is necessary. For this, this paper intends to analyze the characteristics of failed jobs in high performance computing system. Our contributions can be viewed in three ways. Firstly, it offers detailed analysis results of failed jobs based on the job logs of a currently operating supercomputer. Secondly, it offers not only an overall statistical analysis result but also identifies the distribution of the failed job submission inter-arrival time. Thirdly, it analyzes the occurrence probability of the event using hazard rate.
机译:随着对高性能计算能力的需求不断增长,诸如检查点,故障感知任务调度和系统仿真之类的操作管理技术对于系统的稳定运行变得越来越重要。为了维护和管理稳定的系统,有必要对失败的任务进行详细分析。为此,本文旨在分析高性能计算系统中失败作业的特征。我们的贡献可以从三种方式来看。首先,它基于当前正在运行的超级计算机的作业日志提供失败作业的详细分析结果。其次,它不仅提供整体统计分析结果,而且可以识别失败的作业提交到达时间的分布。第三,利用风险率分析事件的发生概率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号