Optimizing makespan and resource utilization for multi-DNN training in GPU cluster

Zhongjin Li; Victor Chang; Haiyang Hu; Maozhong Fu; Jidong Ge; Francesco Piccialli

首页> 外文期刊>Future generation computer systems >Optimizing makespan and resource utilization for multi-DNN training in GPU cluster

【24h】

Optimizing makespan and resource utilization for multi-DNN training in GPU cluster

机译：优化GPU集群中多DNN训练的Mapespan和资源利用

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.

机译：深神经网络（DNN）已广泛应用于人工智能（AI）的许多领域，在工业和学术界中获得了很大的普及。增加DNN模型的大小确实显着提高了学习准确性。但是，在单个GPU上训练大型DNN模型需要不可接受的等待时间。为了加快培训过程，许多分布式深度学习（DL）系统和框架已发布，并设计用于多个GPU的并行DNN培训。然而，大多数现有研究仅集中在具有同步或异步方法的集中式或分散系统下的单个DNN模型的训练速度。很少有效考虑GPU集群上的多人DNN训练问题，这是作业调度和资源分配的联合优化问题。本文提出了优化Makespan和资源利用率（OMRU）方法来最小化工作完成时间，提高GPU集群中的多DNN训练的资源利用率。具体地，我们首先通过在不同数量的GPU上运行一个时代的作业来收集所有DNN模型的训练速度/时间数据。 OMRU算法，集成作业调度，资源分配和GPU重用策略，然后设计为最小化总作业完成时间（也称为MakEspan），并提高GPU群集资源利用率。采用线性缩放规则（LSR）来调整DNN模型在具有大型MINIBATCH大小的多个GPU上培训时，可以保证模型准确性而无需其他超参数调整。我们在带有环 - 已解密通信架构和带有8个节点的GPU群集的PyTrch上实现了OMRU算法，每个GPU集群有4个NVIDIA V100 GPU。图像分类和行动识别的实验结果表明，与基线调度算法相比，OMRU达到了高达30％的Mapspan减少，平均分别达到了98.4％和99.2％的资源利用。 -Of-最现实的模型精度。

著录项

来源
《Future generation computer systems》 |2021年第12期|206-220|共15页
作者
Zhongjin Li; Victor Chang; Haiyang Hu; Maozhong Fu; Jidong Ge; Francesco Piccialli;
展开▼
作者单位

School of Computer Science and Technology Hangzhou Dianzi University Hangzhou China;

Artificial Intelligence and Information Systems Research Group School of Computing Engineering and Digital Technologies Teesside University Middlesbrough UK;

School of Computer Science and Technology Hangzhou Dianzi University Hangzhou China;

School of Computer Science and Technology Hangzhou Dianzi University Hangzhou China;

State Key Laboratory for Novel Software Technology Software Institute Nanjing University Nanjing China;

Department of Mathematics and Applications 'R. Caccioppoli' (DMA) of the University of Naples Federico Ⅱ (UNINA) Italy;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep neural network (DNN) training; Ring-Allreduce; Job scheduling; Resource allocation; Linear scaling rule (LSR); GPU cluster;

机译：深神经网络（DNN）培训;戒指;工作计划;资源分配;线性缩放规则（LSR）;GPU集群;
入库时间 2022-08-19 02:30:25

相似文献

外文文献
中文文献
专利

1. Optimization of makespan and resource utilization in the fog computing environment through task scheduling algorithm [J] . Vijayalakshmi R., Vasudevan V., Kadry Seifedine, International Journal of Wavelets, Multiresolution and Information Processing . 2020,第1期

机译：通过任务调度算法优化迷雾计算环境中的薄纱和资源利用
2. Optimizing resource utilization during proficiency-based training of suturing skills in medical students: a randomized controlled trial of faculty-led, peer tutor-led, and holography-augmented methods of teaching [J] . Surgical Endoscopy . 2020,第4期

机译：优化医学生缝合技能培养期间的资源利用：随机对照师型师，同行导师主导，全科教学方法
3. Limited feedback and video tutorials optimize learning and resource utilization during laparoscopic simulator training. [J] . Stefanidis D, Korndorffer-JR Jr, Heniford BT, Surgery . 2007,第2期

机译：有限的反馈和视频教程可在腹腔镜模拟器训练期间优化学习和资源利用。
4. Utilizing Clustering to Optimize Resource Demand Estimation Approaches [C] . Johannes Grohmann, Simon Eismann, Andre Bauer, 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems . 2019

机译：利用聚类来优化资源需求估算方法
5. Automatic transformation and optimization of applications on GPUs and GPU clusters. [D] . Ma, Wenjing. 2011

机译：在GPU和GPU群集上自动转换和优化应用程序。
6. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [O] . Yuling Fang, Qingkui Chen, Neal N. Xiong, 2017

机译：RGCA：基于有效性能-能源优化的可靠的GPU集群架构用于大规模物联网计算
7. Dynamic Resource Management for Efficient Utilization of Multitasking GPUs [O] . Jason Jong Kyu Park, Yongjun Park, Scott Mahlke 2017

机译：多任务GPU有效利用的动态资源管理

Optimizing makespan and resource utilization for multi-DNN training in GPU cluster

摘要

著录项

相似文献

相关主题

期刊订阅