首页> 外文会议>Supercomputing frontiers >Machine Learning Predictions for Underestimation of Job Runtime on HPC System

【24h】

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

机译：低估HPC系统上作业运行时间的机器学习预测

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In modern high-performance computing (HPC) systems, users are usually requested to estimate the job runtime for system scheduling when they submit a job. In general, an underestimation of job runtime will cause the HPC system to terminate the job before its completion. If users could be notified that their jobs may not finish before its allocated time expires, users can take actions, such as killing the job and resubmitting it after parameter adjustment, to save time and cost. Meanwhile, the productivity of HPC systems could also be vastly improved. In this paper, we propose a data-driven approach - that is, one that actively observes, analyzes, and logs jobs - for predicting underestimation of job runtime on HPC systems. Using data produced by TSUBAME 2.5, a supercomputer deployed at the Tokyo Institute of Technology, we apply machine learning algorithms to recognize patterns about whether the underestimation of job runtime occurs. Our experimental results show that our approach on runtime-underestimation prediction with 80% precision, 70% recall and 74% F1-score on the entirety of a given dataset. Finally, we split the entire job data set into subsets categorized by scientific application name. The best precision, recall and F1-score of subsets on runtime-underestimation prediction achieved 90%, 95% and 92% respectively.

机译：在现代高性能计算（HPC）系统中，通常要求用户在提交作业时估计作业运行时间以进行系统调度。通常，对作业运行时间的低估将导致HPC系统在作业完成之前终止作业。如果可以在分配的时间到期之前通知用户其作业可能未完成，则用户可以采取措施，例如杀死该作业并在参数调整后重新提交，以节省时间和成本。同时，HPC系统的生产率也可以大大提高。在本文中，我们提出了一种数据驱动的方法，即主动观察，分析和记录作业的方法，用于预测HPC系统上作业运行时间的低估。使用东京工业大学部署的超级计算机TSUBAME 2.5产生的数据，我们应用机器学习算法来识别关于是否发生作业时间低估的模式。我们的实验结果表明，在整个给定数据集上，我们的运行时低估预测方法具有80％的精度，70％的查全率和74％的F1得分。最后，我们将整个工作数据集分为按科学应用名称分类的子集。在运行时低估预测中，子集的最佳精度，召回率和F1得分分别达到90％，95％和92％。

著录项

来源
《Supercomputing frontiers》|2018年|179-198|共20页
会议地点 Singapore(SG)
作者
Jian Guo; Akihiro Nomura; Ryan Barton; Haoyu Zhang; Satoshi Matsuoka;
展开▼
作者单位

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan;

Department of Mathematical and Computing Science, Tokyo Institute of Technology, Tokyo, Japan,Global Scientific Information and Computing Center, Tokyo Institute of Technology, Tokyo, Japan,Real World Big-Data Computing Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
HPC; Job log analysis Underestimation on job runtime; Machine learning;

机译：HPC；作业日志分析作业运行时估计不足；机器学习;

相似文献

外文文献
中文文献
专利

1. Failure prediction using machine learning in a virtualised HPC system and application [J] . Mohammed Bashir, Awan Irfan, Ugail Hassan, Cluster computing . 2019,第2期

机译：虚拟化HPC系统中使用机器学习的故障预测和应用程序
2. Predictions-on-chip: model-based training and automated deployment of machine learning models at runtime [J] . Pilarski Sebastian, Staniszewski Martin, Bryan Matthew, Software and systems modeling . 2021,第3期

机译：片上预测：运行时在机器学习模型的基于模型的培训和自动部署
3. A machine learning approach to online fault classification in HPC systems [J] . Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Future generation computer systems . 2020,第Sepa期

机译：HPC系统在线故障分类的机器学习方法
4. Machine Learning Predictions for Underestimation of Job Runtime on HPC System [C] . Jian Guo, Akihiro Nomura, Ryan Barton, Asian Supercomputing Conference . 2018

机译：用于低估HPC系统工作运行时的机器学习预测
5. HPC and Machine Learning Techniques for Reducing the Computation Burden of Determining Time-Evolution of Complex Dynamic Systems [D] . Lakshmiranganatha, Sumathi. 2021

机译：HPC和机器学习技术，用于减少确定复杂动态系统的时间演化的计算负担
6. Machine Learning-Based Prediction of Crystal Systemsand Space Groups from Inorganic Materials Compositions [O] . Yong Zhao, Yuxin Cui, Zheng Xiong, 2020

机译：基于机器学习的晶体系统预测无机材料组成的空间和空间群
7. Machine Learning Predictions for Underestimation of Job Runtime on HPC System [O] . Jian Guo, Akihiro Nomura, Ryan Barton, 2018

机译：用于低估HPC系统工作运行时的机器学习预测

Machine Learning Predictions for Underestimation of Job Runtime on HPC System

摘要

著录项

相似文献

相关主题

期刊订阅