首页> 外文会议>International Conference on Cloud Computing and Services Science >DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster
【24h】

DRAGON: A Dynamic Scheduling and Scaling Controller for Managing Distributed Deep Learning Jobs in Kubernetes Cluster

机译:DRAGON:用于管理Kubernetes集群中分布式深度学习工作的动态调度和缩放控制器

获取原文

摘要

With the fast growing trend in deep learning driven AI services over the past decade, deep learning, especially the resource-intensive and time-consuming training jobs, have become one of the main workload in today's production clusters. However, due to the complex workload characteristics of deep learning, and the dynamic natural of shared resource environment, managing the resource allocation and execution lifecycle of distributed training jobs in cluster can be challenging. This work aims to address these issues by developing and implementing a scheduling and scaling controller to dynamically manage distributed training jobs on a Kubernetes (K8S) cluster, which is a broadly used platform for managing containerized workloads and services. The objectives of our proposed approach is to enhance K8S with three capabilities: (1) Task dependency aware gang scheduling to avoid idle resources. (2) Locality aware task placement to minimize communication overhead. (3) Load aware job scaling to improve cost efficiency. Our approach is evaluated by real testbed and simulator using a set of TensorFlow jobs. Comparing to the default K8S scheduler, our approach successfully improved resource utilization by 20%~30% and reduced job elapsed time by over 65%.
机译:随着深入学习的快速增长趋势,在过去十年中,深入学习的AI服务,深度学习,尤其是资源密集型和耗时的培训工作,已成为当今生产集群的主要工作量之一。但是,由于深度学习的复杂工作量特征,以及共享资源环境的动态自然,管理集群中分布式训练作业的资源分配和执行生命周期可能是具有挑战性的。这项工作旨在通过开发和实施调度和缩放控制器来解决这些问题,以动态管理Kubernetes(K8S)集群上的分布式训练作业,这是一个广泛使用的用于管理容器化工作负载和服务的平台。我们提出的方法的目标是增强具有三种能力的K8:(1)任务依赖性了解Gang调度以避免空闲资源。 (2)局部性意识到任务放置以最大限度地减少通信开销。 (3)加载意识的作业缩放以提高成本效率。我们的方法是通过使用一组TensorFlow作业的真实测试平台和模拟器来评估。与默认的K8S调度程序相比,我们的方法成功地提高了资源利用率20%〜30%,减少了超过65%的工作时间。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号