首页> 外文会议>International conference on high performance computing >Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning
【24h】

Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

机译:Zeno:使用机器学习进行分布式计算的Straggler诊断系统

获取原文

摘要

Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsuper-vised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google's Borg system and an Alibaba's Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.
机译:用于云计算和高性能计算的现代分布式计算框架通常通过将大型作业划分为用于执行并行性的小型任务来提高作业性能。但是,某些任务可能远远落后于其他任务,从而危及工作完成时间。在本文中,我们介绍了Zeno,这是一种新颖的系统,它可以通过机器学习方法自动识别和诊断工作中的闲散者。首先,系统使用无监督聚类方法识别散乱者,该方法根据任务的执行时间对任务进行分组。然后,它使用监督规则学习算法来学习诊断规则,以用其资源分配和使用数据推断散乱的人。对Zeno的评估来自Google的Borg系统和阿里巴巴的Fuxi系统。结果表明,我们的系统能够生成简单且易于阅读的规则,具有有价值的洞察力和在预测散乱者方面的出色表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号