...
首页> 外文期刊>Journal of supercomputing >BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster
【24h】

BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster

机译:BOA:批处理编排算法,用于减轻异构GPU集群中分布式DL训练的拖累

获取原文
获取原文并翻译 | 示例
           

摘要

Training deep learning model is a time-consuming job since it usually uses a large amount of data. To reduce the training time, most practitioners train the models in a GPU cluster environment in a distributed way. The synchronous stochastic gradient descent, which is one of the widely used distributed training algorithms, has fast convergence rate with the use of multiple GPU workers; but its speed is tied to the slowest worker, i.e., straggler. In a heterogeneous environment, a static straggler, which has not been mainly focused before, has more impact on the performance than randomly occurring straggler. However, most existing studies for straggler mitigation usually consider a homogeneous environment, so their approaches are limited in practice. In this paper, we scrutinize the straggler problem under heterogeneous environment and define static and dynamic straggler from empirical results. Based on this, we propose a novel approach called batch orchestration algorithm (BOA) for straggler mitigation. It adaptively balances the amount of mini-batch data according to the speed of workers. Therefore, BOA can mitigate both static and dynamic straggler in a modern GPU cluster. BOA uses a Min-Max Integer programming to find the optimal mini-batch size, with the hardware-agnostic performance models. For verification, several experiments are conducted on a cluster having up to six GPUs with three types: GTX 1080, GTX 1060 and Quadro M2000. The results show BOA mitigates both types of stragglers and accelerates the training speed with synchronous SGD compared to other straggler mitigation method.
机译:训练深度学习模型是一项耗时的工作,因为它通常使用大量数据。为了减少训练时间,大多数从业者都在GPU集群环境中以分布式方式训练模型。同步随机梯度下降是广泛使用的分布式训练算法之一,在使用多个GPU工人时具有很快的收敛速度。但是它的速度取决于最慢的工人,即散乱的人。在异构环境中,以前从未主要关注的静态散乱者比随机散乱者对性能的影响更大。但是,大多数现有的减轻流浪者的研究通常考虑的是同质环境,因此它们的方法在实践中受到限制。在本文中,我们仔细研究了异构环境下的散乱者问题,并根据经验结果定义了静态散乱者和动态散乱者。在此基础上,我们提出了一种新的方法,称为批量协调算法(BOA),用于缓解流浪汉。它根据工作人员的速度自适应地平衡小批量数据的数量。因此,BOA可以减轻现代GPU群集中的静态和动态散乱性。 BOA使用Min-Max Integer编程来找到最佳的最小批处理大小,并具有与硬件无关的性能模型。为了进行验证,在具有多达六个GPU的群集上进行了一些实验,这三种GPU分别为GTX 1080,GTX 1060和Quadro M2000。结果表明,与其他缓解方式相比,采用同步SGD的BOA可以同时缓解这两种类型的缓解方式,并加快了训练速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号