Dynamic Resource Scheduler for Distributed Deep Learning Training in Kubernetes

机译：Kubernetes中分布式深度学习培训的动态资源调度程序

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Distributed deep learning is a method of machine learning that is used today due to its many advantages. One of the many tools used to train distributed deep learning model is Kubeflow, which runs on top of Kubernetes. Kubernetes is a containerized application orchestrator that ease the deployment process of applications. This in turn makes distributed deep learning training done in Kubeflow easier and manageable. Works on a dynamic resource scheduler in Kubernetes for deep learning training have been done before, such as DRAGON that proposed scheduler with autoscaling and gang scheduling capabilities, and OASIS that proposed a utility system with price function. In this work, we propose to combine DRAGON's and OASIS' approach to make a scheduler with weighted autoscaling capabilities and schedule its jobs with gang scheduling. Some modifications are done on DRAGON's autoscaling function. We try to increase the frequency of scaling up function calls and reduce the frequency of scaling down function to make the training process more efficient. Weights are used to determine the priority of each jobs, where jobs with higher resource requirements are considered more important. Weight of each jobs will influence the autoscaling function of the scheduler. Experiment and evaluation done using a set of Tensorflow jobs results in an increase of training speed by over 26% in comparison with the default Kubernetes scheduler.

机译：分布式深度学习是一种机器学习方法，由于其许多优点而使用。用于培训分布式深度学习模型的众多工具之一是Kubeflow，它在Kubernetes之上运行。 Kubernetes是一个容器化的应用程序orchestrator，可简化应用程序的部署过程。这反过来又在Kubeflow中完成了分布式的深度学习培训，更轻松和可管理。在Kubernetes中的动态资源调度程序工作以前已经完成了深度学习培训，例如龙，例如具有自动播放和团伙调度功能的调度程序，以及提出具有价格函数的公用事业系统的OASIS。在这项工作中，我们建议将Dragon和Oasis的方法结合起来，使调度程序具有加权自动阶段功能，并将其与Gang Scheduling进行安排。在Dragon的自动播放功能上完成了一些修改。我们尝试提高缩放功能调用的频率并降低缩放功能的频率，使训练过程更有效。重量用于确定每个作业的优先级，其中具有更高资源需求的作业被认为更为重要。每个作业的重量将影响调度程序的自动播放功能。使用一组TensoRFlow作业完成的实验和评估导致训练速度的增加超过26％，与默认的Kubernetes调度程序相比。

著录项

来源
《International Conference on Advance Informatics: Concepts, Theory and Applications》|2020年|1-6|共6页
会议地点
作者
Muhammad Fadhriga Bestari; Achmad Imam Kistijantoro; Anggrahita Bayu Sasmita;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Deep learning; Schedules; Estimation; Tools; Dynamic scheduling; Informatics;

机译：培训;深入学习;时间表;估计;工具;动态调度;信息学;

相似文献

外文文献
中文文献
专利

1. Data-driven dynamic resource scheduling for network slicing: A Deep reinforcement learning approach [J] . Wang Haozhe, Wu Yulei, Min Geyong, Information Sciences: An International Journal . 2019,第期

机译：网络切片数据驱动动态资源调度：深度加强学习方法
2. Deep-learning-based power distribution network switch action identification leveraging dynamic features of distributed energy resources [J] . Duan Nan, Stewart Emma M. Generation, Transmission & Distribution, IET . 2019,第14期

机译：利用分布式能源动态特性的基于深度学习的配电网开关动作识别
3. Evaluating a range of learning schedules: hybrid training schedules may be as good as or better than distributed practice for some tasks (vol 59, pg 276, 2016) [J] . Paik J., Ritter F. E. Ergonomics . 2016,第7期

机译：评估一系列学习时间表：对于某些任务，混合训练时间表可能比分布式练习好或更好（第59卷，第276页，2016年）
4. DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster [C] . Chan-Yi Lin, Ting-An Yeh, Jerry Chou International Conference on Cloud Computing and Services Science . 2019

机译：DRAGON：用于管理Kubernetes集群中分布式深度学习工作的动态调度和缩放控制器
5. Learning aided system performance modeling in support of self-optimized resource scheduling in distributed environments. [D] . Zhang, Jian. 2007

机译：学习辅助系统性能建模，以支持分布式环境中的自优化资源调度。
6. Distributed Learning: Revitalizing Anesthesiology Training in Resource-Limited Ethiopia [O] . Krupa B. Patel, Morgan Dooley, Ananya Abate, 2017

机译：分布式学习：振兴资源有限的埃塞俄比亚麻醉学培训
7. Editorial: Resource management in parallel and distributed systems with dynamic scheduling: Dynamic scheduling [O] . Ishfaq Ahmad 1995

机译：编辑：具有动态调度的并行和分布式系统的资源管理：动态调度

Dynamic Resource Scheduler for Distributed Deep Learning Training in Kubernetes

摘要

著录项

相似文献

相关主题

期刊订阅