首页> 外文会议>International Symposium on Computing and Networking Workshops >Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads
【24h】

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

机译:分层分布式 - 内存多领导MPI-ALREDUCE用于深度学习工作负载

获取原文

摘要

Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discrete-event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce micro-benchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.
机译:受到深度学习模型中复杂性和大小的增加驱动,大规模GPU加速集群的培训模型正在成为一个普遍存在的群体。分布式培训的主要挑战之一是非常大的信息尺寸的集体通信开销:从几到数百MB。在本文中,我们利用了两个分层分布式存储器多领导算法,用于GPU加速群集优化(命名为LR_LR和LR_RAB)。其中,一个节点使用被指定为节点领导者的其他GPU来并行地执行节点间数据传输。每个领导者都会保持并交换本地减少价值的部分结果而不是整个人。因此,我们能够显着减少将数据注入节点网络的时间。我们在离散事件仿真SimGrid上评估这些算法。我们表明我们的算法,LR_LR和LR_RAB可以分别削减使用逻辑环算法(LR)至45%和51%的逻辑环算法(LR)的执行时间。此外,节省高达23%和32%的网络设备的功耗是预计的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号