首页> 外文会议>IEEE International Conference on Distributed Computing Systems >Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations
【24h】

Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations

机译:从多个集群中的失败学习:一种追踪的理解,预测和减轻工作终端的方法

获取原文

摘要

In large-scale computing platforms, jobs are prone to interruptions and premature terminations, limiting their usability and leading to significant waste in cluster resources. In this paper, we tackle this problem in three steps. First, we provide a comprehensive study based on log data from multiple large-scale production systems to identify patterns in the behaviour of unsuccessful jobs across different clusters and investigate possible root causes behind job termination. Our results reveal several interesting properties that distinguish unsuccessful jobs from others, particularly w.r.t. resource consumption patterns and job configuration settings. Secondly, we design a machine learning-based framework for predicting job and task terminations. We show that job failures can be predicted relatively early with high precision and recall, and also identify attributes that have strong predictive power of job failure. Finally, we demonstrate in a concrete use case how our prediction framework can be used to mitigate the effect of unsuccessful execution using an effective task-cloning policy that we propose.
机译:在大型计算平台中,作业容易发生中断和过早的终端,限制其可用性,并导致集群资源中的大量浪费。在本文中,我们在三个步骤中解决了这个问题。首先,我们根据来自多个大型生产系统的日志数据提供了全面的研究,以识别不同群集的不成功工作行为的模式,并调查工作终止后面可能的根本原因。我们的结果揭示了几个有趣的属性,可以区分从其他人,特别是w.r.t.的工作。资源消耗模式和作业配置设置。其次,我们设计了一种基于机器学习的框架,用于预测作业和任务终端。我们表明,高精度和召回,可以相对早期预测工作失败,并识别具有强大预测力的作业失败的属性。最后,我们在具体用例中展示了我们的预测框架如何使用我们提出的有效任务克隆政策来减轻失败的执行的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号