首页> 外文会议>International Conference on Advance Informatics: Concepts, Theory and Applications >Dynamic Resource Scheduler for Distributed Deep Learning Training in Kubernetes
【24h】

Dynamic Resource Scheduler for Distributed Deep Learning Training in Kubernetes

机译:Kubernetes中分布式深度学习培训的动态资源调度程序

获取原文

摘要

Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON's and OASIS' approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON's autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler.
机译:分布式深度学习是一种机器学习方法,由于其许多优点而使用。用于培训分布式深度学习模型的众多工具之一是Kubeflow,它在Kubernetes之上运行。 Kubernetes是一个容器化的应用程序orchestrator,可简化应用程序的部署过程。这反过来又在Kubeflow中完成了分布式的深度学习培训,更轻松和可管理。在Kubernetes中的动态资源调度程序工作以前已经完成了深度学习培训,例如龙,例如具有自动播放和团伙调度功能的调度程序,以及提出具有价格函数的公用事业系统的OASIS。在这项工作中,我们建议将Dragon和Oasis的方法结合起来,使调度程序具有加权自动阶段功能,并将其与Gang Scheduling进行安排。在Dragon的自动播放功能上完成了一些修改。我们尝试提高缩放功能调用的频率并降低缩放功能的频率,使训练过程更有效。重量用于确定每个作业的优先级,其中具有更高资源需求的作业被认为更为重要。每个作业的重量将影响调度程序的自动播放功能。使用一组TensoRFlow作业完成的实验和评估导致训练速度的增加超过26%,与默认的Kubernetes调度程序相比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号