首页> 外文会议>Supercomputing frontiers >Machine Learning Predictions for Underestimation of Job Runtime on HPC System
【24h】

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

机译:低估HPC系统上作业运行时间的机器学习预测

获取原文
获取原文并翻译 | 示例

摘要

In modern high-performance computing (HPC) systems, users are usually requested to estimate the job runtime for system scheduling when they submit a job. In general, an underestimation of job runtime will cause the HPC system to terminate the job before its completion. If users could be notified that their jobs may not finish before its allocated time expires, users can take actions, such as killing the job and resubmitting it after parameter adjustment, to save time and cost. Meanwhile, the productivity of HPC systems could also be vastly improved. In this paper, we propose a data-driven approach - that is, one that actively observes, analyzes, and logs jobs - for predicting underestimation of job runtime on HPC systems. Using data produced by TSUBAME 2.5, a supercomputer deployed at the Tokyo Institute of Technology, we apply machine learning algorithms to recognize patterns about whether the underestimation of job runtime occurs. Our experimental results show that our approach on runtime-underestimation prediction with 80% precision, 70% recall and 74% F1-score on the entirety of a given dataset. Finally, we split the entire job data set into subsets categorized by scientific application name. The best precision, recall and F1-score of subsets on runtime-underestimation prediction achieved 90%, 95% and 92% respectively.
机译:在现代高性能计算(HPC)系统中,通常要求用户在提交作业时估计作业运行时间以进行系统调度。通常,对作业运行时间的低估将导致HPC系统在作业完成之前终止作业。如果可以在分配的时间到期之前通知用户其作业可能未完成,则用户可以采取措施,例如杀死该作业并在参数调整后重新提交,以节省时间和成本。同时,HPC系统的生产率也可以大大提高。在本文中,我们提出了一种数据驱动的方法,即主动观察,分析和记录作业的方法,用于预测HPC系统上作业运行时间的低估。使用东京工业大学部署的超级计算机TSUBAME 2.5产生的数据,我们应用机器学习算法来识别关于是否发生作业时间低估的模式。我们的实验结果表明,在整个给定数据集上,我们的运行时低估预测方法具有80%的精度,70%的查全率和74%的F1得分。最后,我们将整个工作数据集分为按科学应用名称分类的子集。在运行时低估预测中,子集的最佳精度,召回率和F1得分分别达到90%,95%和92%。

著录项

  • 来源
    《Supercomputing frontiers》|2018年|179-198|共20页
  • 会议地点 Singapore(SG)
  • 作者单位

    Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

    Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

    Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

    Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

    Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan,Global Scientific Information and Computing Center, Tokyo Institute of Technology, Tokyo, Japan,Real World Big-Data Computing Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    HPC; Job log analysis Underestimation on job runtime; Machine learning;

    机译:HPC;作业日志分析作业运行时估计不足;机器学习;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号