首页> 外文会议>Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops >Predicting Job Completion Times Using System Logs in Supercomputing Clusters
【24h】

Predicting Job Completion Times Using System Logs in Supercomputing Clusters

机译:使用System Logs在超级计算集群中预测工作完成时间

获取原文

摘要

Most large systems such as HPC/cloud computing clusters and data centers are built from commercial off-the-shelf components. System logs are usually the main source of choice to gain insights into the system issues. Therefore, mining logs to diagnose anomalies has been an active research area. Due to the lack of organization and semantic consistency in commodity PC clusters' logs, what constitutes a fault or an error is subjective and thus building an automatic failure prediction model from log messages is hard. In this paper we sidestep the difficulty by asking a different question: Given the concomitant system log messages of a running job, can we predict the job's remaining time? We adopt Hidden Markov Model (HMM) coupled with frequency analysis to achieve this. Our HMM approach can predict 75% of jobs' remaining times with an error of less than 200 seconds.
机译:大多数大型系统,如HPC /云计算集群和数据中心是由商业现成部件构建的。系统日志通常是在系统问题中获得见解的主要选择源。因此,挖掘原木以诊断异常是一个有源研究区域。由于商品PC集群日志中缺乏组织和语义一致性,因此构成故障或错误是主观的,从而从日志消息中构建自动故障预测模型很难。在本文中,我们通过询问不同的问题来难以询问:给定伴随着运行工作的系统日志消息,我们可以预测作业的剩余时间吗?我们采用隐藏的马尔可夫模型(HMM)与频率分析相结合以实现这一目标。我们的HMM方法可以预测75%的工作剩余时间,错误误差小于200秒。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号