Job Placement Strategy with Opportunistic Resource Sharing for Distributed Deep Learning Clusters

机译：具有机遇主义资源共享的工作安排策略分布式深度学习集群

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Distributed deep learning frameworks train large deep leaning workload with multiple training jobs on shared distributed GPU servers. There are new challenges when scheduling resources for these systems. Modern deep learning training jobs tend to consume large amount of GPU memory. A training job has an iterative nature that causes the memory usage fluctuate overtime. Jobs sharing a host may suffer from significant performance degradation caused by memory overload in runtime. Moreover, even without memory overloads, deep learning training jobs still experience different levels of performance interference when sharing a GPU device. This paper studies these two issues. We introduced an opportunistic memory sharing model to allocate resources for training jobs with time-varying memory requirements. Based on this model, we introduced an opportunistic Job Placement Problem (OJPP) for shared GPU clusters that seeks job placement configurations using minimum number of GPU devices and guarantees user-defined performance requirements. We proposed a greedy algorithm and a heuristic algorithm with computational complexities of $O(nlog n)$ and $O(n^{2}log n)$, respectively, to solve the problem. Extensive experiments are conducted using a GPU cluster to verify the correctness, effectiveness, and the scalability of our approaches. The proposed approach achieved over 80% percent of the standalone performance, in term of average job completion time, with less than 30% extra resources consumption.

机译：分布式深度学习框架在共享分布式GPU服务器上使用多种培训作业培训大型深层倾斜工作负载。在这些系统的调度资源时存在新的挑战。现代深度学习培训工作往往消耗大量GPU记忆。培训工作具有迭代性质，导致内存使用量波动加速。共享主机的作业可能会遭受运行时内存过载引起的显着性能下降。此外，即使没有内存过载，在共享GPU设备时，深度学习培训工作仍然体验不同的性能干扰。本文研究了这两个问题。我们介绍了一个机会主义的内存共享模型，以分配资源，以便以时变的内存要求为培训工作。基于此模型，我们为共享GPU集群推出了一个机会主义的作业位置问题（OJPP），使用最小数量的GPU设备寻求作业配置配置，并保证用户定义的性能要求。我们提出了一种贪婪的算法和具有$ O（n log n）$和$ o（n ^ {2} log n）$的计算复杂性的启发式算法，以解决问题。使用GPU集群进行广泛的实验，以验证我们方法的正确性，有效性和可扩展性。拟议的方法在平均工作完成时间的期限内实现了超过80％的独立性能，额外资源消耗量不到30％。

著录项

来源
《International Conference on High Performance Computing and Communications;International Conference on Smart City;IEEE International Conference on Data Science and Systems》|2020年|620-627|共8页
会议地点
作者
Hongliang Li; Ting Sun; Xiang Li; Haixiao Xu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Training; Deep learning; Performance evaluation; Greedy algorithms; High performance computing; Heuristic algorithms; Computational modeling;

机译：培训;深入学习;性能评估;贪婪算法;高性能计算;启发式算法;计算建模;

相似文献

外文文献
中文文献
专利

1. Mary, Hugo, and Hugo: Learning to schedule distributed data-parallel processing jobs on shared clusters [J] . Thamsen Lauritz, Beilharz Jossekin, Vinh Thuy Tran, Concurrency and computation: practice and experience . 2021,第18期

机译：Mary，Hugo和Hugo：学习在共享群集中安排分布式数据并行处理作业
2. Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster [J] . Víctor Campos, Francesc Sastre, Maurici Yagües, Procedia Computer Science . 2017,第1期

机译：分布式GPU集群上计算机视觉深度学习算法的分布式训练策略
3. ROCK-CNN: Distributed Deep Learning Computations in a Resource-Constrained Cluster [J] . Rezeda Khaydarova, Dmitriy Mouromtsev, Vladislav Fishchenko, International journal of embedded and real-time communication systems . 2021,第3期

机译：Rock-CNN：资源受限群集中的分布式深度学习计算
4. Deep Learning-based Job Placement in Distributed Machine Learning Clusters [C] . Yixin Bao, Yanghua Peng, Chuan Wu IEEE Conference on Computer Communications . 2019

机译：分布式机器学习集群中基于深度学习的工作安置
5. Co-Location of Deep Learning Jobs in GPU Clusters [D] . Tabrizian, Iman. 2021

机译：GPU集群中深度学习工作的共同位置
6. Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification [O] . Michel Oleynik, Amila Kugic, Zdenko Kasáč, 2019

机译：评估2018年N2C2临床文本分类共享任务的浅层和深度学习策略
7. Mary, Hugo, and Hugo*: Learning to schedule distributed data‐parallel processing jobs on shared clusters [O] . Lauritz Thamsen, Jossekin Beilharz, Vinh Thuy Tran, 2020

机译：Mary，Hugo和Hugo *：学习在共享群集中安排分布式数据并行处理作业
8. Enforcing Resource Sharing Agreements Among Distributed Server Clusters [R] . Zhao, T. , Karamcheti, V. 2001

机译：在分布式服务器群集中实施资源共享协议

Job Placement Strategy with Opportunistic Resource Sharing for Distributed Deep Learning Clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅