...
首页> 外文期刊>Concurrency and computation: practice and experience >EfficientMPI-AllReduce for large-scale deep learning on GPU-clusters
【24h】

EfficientMPI-AllReduce for large-scale deep learning on GPU-clusters

机译:用于GPU-集群的大规模深度学习的高效性杂志

获取原文
获取原文并翻译 | 示例
           

摘要

Training models on large-scale GPUs-accelerated clusters are becoming a commonplace due to the increase in complexity and size in deep learning models. One of the main challenges for distributed training is the collective communication overhead for large message sizes: up to hundreds of MB. In this paper, we propose two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab), in which GPUs inside a computing node perform an intra-node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders). Node leaders then keep a role as an inter-node communicator. Each leader exchanges one part of reduced values to the leaders of the other nodes in parallel. Hence, we are capable of significantly reducing the time for injecting data into the inter-node network. We also overlap the inter-node and intra-node communication by implementing our proposal in a pipelined manner. We evaluate those algorithms on the discrete-event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an AllReduce microbenchmark that uses the logical ring algorithm (lr) by up to 45% and 51%, respectively. With the pipelined implementation, our lr_lr_pipe achieves 15% performance improvement when compared with lr_lr. In addition, the simulation result also projects power savings for the network devices of up to 23% and 32%.
机译:由于深度学习模型中的复杂性和大小的增加,大规模GPU加速集群的培训模型正在成为一个普遍的普遍存在。分布式培训的主要挑战之一是大型信息尺寸的集体通信开销:高达数百MB。在本文中,我们提出了两个分层分布式内存多层段,用于GPU加速集群(命名为LR_LR和LR_RAB),其中计算节点内的GPU执行内部通信阶段,以收集和存储本地降低值的结果。指定GPU(称为节点领导者)。节点领导者然后将角色保持为节点间通信器。每个领导者并行将减少值的一部分交换为另一个节点的领导者。因此,我们能够显着减少将数据注入节点网络中的时间。我们还通过以流水线方式实现我们的提议来重叠节点间和节点内部通信。我们在离散事件仿真SimGrid上评估这些算法。我们表明我们的算法,LR_LR和LR_RAB可以缩小逻辑环算法(LR)分别使用高达45%和51%的逻辑环算法(LR)的执行时间。通过流水线实现,与LR_LR相比,我们的LR_LR_PIPE会达到15%的性能改进。此外,仿真结果还投影了高达23%和32%的网络设备的电量节省。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号