【24h】

Job Placement Strategy with Opportunistic Resource Sharing for Distributed Deep Learning Clusters

机译:具有机遇主义资源共享的工作安排策略分布式深度学习集群

获取原文
获取外文期刊封面目录资料

摘要

Distributed deep learning frameworks train large deep leaning workload with multiple training jobs on shared distributed GPU servers. There are new challenges when scheduling resources for these systems. Modern deep learning training jobs tend to consume large amount of GPU memory. A training job has an iterative nature that causes the memory usage fluctuate overtime. Jobs sharing a host may suffer from significant performance degradation caused by memory overload in runtime. Moreover, even without memory overloads, deep learning training jobs still experience different levels of performance interference when sharing a GPU device. This paper studies these two issues. We introduced an opportunistic memory sharing model to allocate resources for training jobs with time-varying memory requirements. Based on this model, we introduced an opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seeks job placement configurations using minimum number of GPU devices and guarantees user-defined performance requirements. We proposed a greedy algorithm and a heuristic algorithm with computational complexities of $O(nlog n)$ and $O(n^{2}log n)$, respectively, to solve the problem. Extensive experiments are conducted using a GPU cluster to verify the correctness, effectiveness, and the scalability of our approaches. The proposed approach achieved over 80% percent of the standalone performance, in term of average job completion time, with less than 30% extra resources consumption.
机译:分布式深度学习框架在共享分布式GPU服务器上使用多种培训作业培训大型深层倾斜工作负载。在这些系统的调度资源时存在新的挑战。现代深度学习培训工作往往消耗大量GPU记忆。培训工作具有迭代性质,导致内存使用量波动加速。共享主机的作业可能会遭受运行时内存过载引起的显着性能下降。此外,即使没有内存过载,在共享GPU设备时,深度学习培训工作仍然体验不同的性能干扰。本文研究了这两个问题。我们介绍了一个机会主义的内存共享模型,以分配资源,以便以时变的内存要求为培训工作。基于此模型,我们为共享GPU集群推出了一个机会主义的作业位置问题(OJPP),使用最小数量的GPU设备寻求作业配置配置,并保证用户定义的性能要求。我们提出了一种贪婪的算法和具有$ O(n log n)$和$ o(n ^ {2} log n)$的计算复杂性的启发式算法,以解决问题。使用GPU集群进行广泛的实验,以验证我们方法的正确性,有效性和可扩展性。拟议的方法在平均工作完成时间的期限内实现了超过80%的独立性能,额外资源消耗量不到30%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号