首页> 外文会议>International Conference on Computational Science and Applications >Exploiting the behavior of the failed job in high performance computing system
【24h】

Exploiting the behavior of the failed job in high performance computing system

机译:利用高性能计算系统中失败的作业的行为

获取原文

摘要

As demand for high-performance computing power is increasing, operation management technologies like check-pointing, failure-aware task scheduling, and system simulations are becoming more important for the stable operation of the system. To maintain and manage a stable system, a detailed analysis of failed tasks is necessary. For this, this paper intends to analyze the characteristics of failed jobs in high performance computing system. Our contributions can be viewed in three ways. Firstly, it offers detailed analysis results of failed jobs based on the job logs of a currently operating supercomputer. Secondly, it offers not only an overall statistical analysis result but also identifies the distribution of the failed job submission inter-arrival time. Thirdly, it analyzes the occurrence probability of the event using hazard rate.
机译:随着对高性能计算能力的需求增加,操作管理技术等检查,故障感知的任务调度和系统模拟对于系统的稳定运行变得越来越重要。为了维护和管理稳定的系统,需要对失败任务进行详细分析。为此,本文打算分析高性能计算系统中失败的作业特征。我们的贡献可以以三种方式查看。首先,它根据当前操作超级计算机的作业日志提供失败的作业的详细分析结果。其次,它不仅提供了整体统计分析结果,还提供了抵达失败的作业提交的分布。第三,它通过危险率分析了事件的发生概率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号