首页> 外文期刊>Future generation computer systems >Optimizing makespan and resource utilization for multi-DNN training in GPU cluster
【24h】

Optimizing makespan and resource utilization for multi-DNN training in GPU cluster

机译:优化GPU集群中多DNN训练的Mapespan和资源利用

获取原文
获取原文并翻译 | 示例
       

摘要

Deep neural network (DNN) has been widely applied in many fields of artificial intelligence (AI), gaining great popularity both in industry and academia. Increasing the size of DNN models does dramatically improve the learning accuracy. However, training large-scale DNN models on a single GPU takes unacceptable waiting time. In order to speed up the training process, many distributed deep learning (DL) systems and frameworks have been published and designed for parallel DNN training with multiple GPUs. However, most of the existing studies concentrate only on improving the training speed of a single DNN model under centralized or decentralized systems with synchronous or asynchronous approaches. Few works consider the issue of multi-DNN training on the GPU cluster, which is the joint optimization problem of job scheduling and resource allocation. This paper proposes an optimizing makespan and resource utilization (OMRU) approach to minimize job completion time and improve resource utilization for multi-DNN training in a GPU cluster. Specifically, we first collect the training speed/time data of all DNN models by running a job for one epoch on a different number of GPUs. The OMRU algorithm, integrating job scheduling, resource allocation, and GPU reuse strategies, is then devised to minimize the total job completion time (also called makespan) and improve GPU cluster resource utilization. The linear scaling rule (LSR) is adopted for adjusting the learning rate when a DNN model is trained on multiple GPUs with large minibatch size, which can guarantee model accuracy without the other hyper-parameters tune-up. We implement the OMRU algorithm on the Pytorch with Ring-Allreduce communication architecture and a GPU cluster with 8 nodes, each of which has 4 NVIDIA V100 GPUs. Experimental results on image classification and action recognition show that OMRU achieves a makespan reduction of up to 30% compared to the baseline scheduling algorithms and reach an average of 98.4% and 99.2% resource utilization on image classification and action recognition, respectively, with the state-of-the-art model accuracy.
机译:深神经网络(DNN)已广泛应用于人工智能(AI)的许多领域,在工业和学术界中获得了很大的普及。增加DNN模型的大小确实显着提高了学习准确性。但是,在单个GPU上训练大型DNN模型需要不可接受的等待时间。为了加快培训过程,许多分布式深度学习(DL)系统和框架已发布,并设计用于多个GPU的并行DNN培训。然而,大多数现有研究仅集中在具有同步或异步方法的集中式或分散系统下的单个DNN模型的训练速度。很少有效考虑GPU集群上的多人DNN训练问题,这是作业调度和资源分配的联合优化问题。本文提出了优化Makespan和资源利用率(OMRU)方法来最小化工作完成时间,提高GPU集群中的多DNN训练的资源利用率。具体地,我们首先通过在不同数量的GPU上运行一个时代的作业来收集所有DNN模型的训练速度/时间数据。 OMRU算法,集成作业调度,资源分配和GPU重用策略,然后设计为最小化总作业完成时间(也称为MakEspan),并提高GPU群集资源利用率。采用线性缩放规则(LSR)来调整DNN模型在具有大型MINIBATCH大小的多个GPU上培训时,可以保证模型准确性而无需其他超参数调整。我们在带有环 - 已解密通信架构和带有8个节点的GPU群集的PyTrch上实现了OMRU算法,每个GPU集群有4个NVIDIA V100 GPU。图像分类和行动识别的实验结果表明,与基线调度算法相比,OMRU达到了高达30%的Mapspan减少,平均分别达到了98.4%和99.2%的资源利用。 -Of-最现实的模型精度。

著录项

  • 来源
    《Future generation computer systems》 |2021年第12期|206-220|共15页
  • 作者单位

    School of Computer Science and Technology Hangzhou Dianzi University Hangzhou China;

    Artificial Intelligence and Information Systems Research Group School of Computing Engineering and Digital Technologies Teesside University Middlesbrough UK;

    School of Computer Science and Technology Hangzhou Dianzi University Hangzhou China;

    School of Computer Science and Technology Hangzhou Dianzi University Hangzhou China;

    State Key Laboratory for Novel Software Technology Software Institute Nanjing University Nanjing China;

    Department of Mathematics and Applications 'R. Caccioppoli' (DMA) of the University of Naples Federico Ⅱ (UNINA) Italy;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Deep neural network (DNN) training; Ring-Allreduce; Job scheduling; Resource allocation; Linear scaling rule (LSR); GPU cluster;

    机译:深神经网络(DNN)培训;戒指;工作计划;资源分配;线性缩放规则(LSR);GPU集群;
  • 入库时间 2022-08-19 02:30:25

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号