首页> 外文会议>International Symposium on Computing and Networking Workshops >Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
【24h】

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

机译:深度学习工作量的分层分布式内存多导师MPI-Allreduce

获取原文

摘要

Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discrete-event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce micro-benchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.
机译:在深度学习模型的复杂性和规模不断增加的驱动下,大规模GPU加速集群上的训练模型变得司空见惯。分布式培训的主要挑战之一是非常大的消息大小的集体通信开销:从几MB到数百MB。在本文中,我们利用了针对GPU加速集群优化的两种分层的分布式内存多线程Allreduce算法(分别称为lr_lr和lr_rab)。其中,一个节点使用指定为节点领导者的其他GPU并行执行节点间数据传输。每个领导者都保留并交换局部降低值的部分结果,而不是整个结果。因此,我们能够显着减少将数据注入节点间网络的时间。我们在离散事件仿真Simgrid上评估这些算法。我们证明了我们的算法lr_lr和lr_rab可以将使用逻辑环算法(lr)的Allreduce微基准的执行时间分别减少多达45%和51%。此外,预计可节省多达23%和32%的网络设备功耗。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号