EfficientMPI-AllReduce for large-scale deep learning on GPU-clusters

Truong Thao Nguyen; Wahib Mohamed; Takano Ryousei

首页> 外文期刊>Concurrency and computation: practice and experience >EfficientMPI-AllReduce for large-scale deep learning on GPU-clusters

【24h】

EfficientMPI-AllReduce for large-scale deep learning on GPU-clusters

机译：用于GPU-集群的大规模深度学习的高效性杂志

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Training models on large-scale GPUs-accelerated clusters are becoming a commonplace due to the increase in complexity and size in deep learning models. One of the main challenges for distributed training is the collective communication overhead for large message sizes: up to hundreds of MB. In this paper, we propose two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab), in which GPUs inside a computing node perform an intra-node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders). Node leaders then keep a role as an inter-node communicator. Each leader exchanges one part of reduced values to the leaders of the other nodes in parallel. Hence, we are capable of significantly reducing the time for injecting data into the inter-node network. We also overlap the inter-node and intra-node communication by implementing our proposal in a pipelined manner. We evaluate those algorithms on the discrete-event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an AllReduce microbenchmark that uses the logical ring algorithm (lr) by up to 45% and 51%, respectively. With the pipelined implementation, our lr_lr_pipe achieves 15% performance improvement when compared with lr_lr. In addition, the simulation result also projects power savings for the network devices of up to 23% and 32%.

机译：由于深度学习模型中的复杂性和大小的增加，大规模GPU加速集群的培训模型正在成为一个普遍的普遍存在。分布式培训的主要挑战之一是大型信息尺寸的集体通信开销：高达数百MB。在本文中，我们提出了两个分层分布式内存多层段，用于GPU加速集群（命名为LR_LR和LR_RAB），其中计算节点内的GPU执行内部通信阶段，以收集和存储本地降低值的结果。指定GPU（称为节点领导者）。节点领导者然后将角色保持为节点间通信器。每个领导者并行将减少值的一部分交换为另一个节点的领导者。因此，我们能够显着减少将数据注入节点网络中的时间。我们还通过以流水线方式实现我们的提议来重叠节点间和节点内部通信。我们在离散事件仿真SimGrid上评估这些算法。我们表明我们的算法，LR_LR和LR_RAB可以缩小逻辑环算法（LR）分别使用高达45％和51％的逻辑环算法（LR）的执行时间。通过流水线实现，与LR_LR相比，我们的LR_LR_PIPE会达到15％的性能改进。此外，仿真结果还投影了高达23％和32％的网络设备的电量节省。

著录项

来源
《Concurrency and computation: practice and experience》 |2021年第12期|e5574.1-e5574.20|共20页
作者
Truong Thao Nguyen; Wahib Mohamed; Takano Ryousei;
展开▼
作者单位

Natl Inst Adv Ind Sci & Technol AIST Tokyo Tech Real World Big Data Computat Open Tokyo Japan;

Natl Inst Adv Ind Sci & Technol AIST Tokyo Tech Real World Big Data Computat Open Tokyo Japan;

Natl Inst Adv Ind Sci & Technol AIST Tokyo Tech Real World Big Data Computat Open Tokyo Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
distributed deep learning; high-performance computing (HPC); MPI; AllReduce;

机译：分布式深度学习;高性能计算（HPC）;MPI;释放;

相似文献

外文文献
中文文献
专利

1. Validating the validation: reanalyzing a large-scale comparison of deep learning and machine learning models for bioactivity prediction [J] . Journal of Computer-Aided Molecular Design . 2020,第7期

机译：验证验证：重新分析深度学习和机器学习模型的大规模比较生物活性预测
2. Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey [J] . Nguyen Giang, Dlugolinsky Stefan, Bobak Martin, Artificial Intelligence Review: An International Science and Engineering Journal . 2019,第1期

机译：机器学习和深度学习框架和库的大型数据挖掘：调查
3. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection [J] . Sergey Levine, Peter Pastor, Alex Krizhevsky, The International journal of robotics research . 2018,第4a5期

机译：通过深度学习和大规模数据收集来学习手眼协调以进行机器人抓取
4. An Allreduce Algorithm and Network Co-design for Large-Scale Training of Distributed Deep Learning [C] . Truong Thao Nguyen, Mohamed Wahib IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing . 2021

机译：一种分布式深度学习大规模培训的解除寿算法与网络共同设计
5. Open Set Classification for Deep Learning in Large-Scale and Continual Learning Models [D] . Roady, Ryne. 2020

机译：在大规模和持续学习模型中开放集分类
6. Academic Emotion Classification and Recognition Method for Large-scale Online Learning Environment—Based on A-CNN and LSTM-ATT Deep Learning Pipeline Method [O] . Xiang Feng, Yaojia Wei, Xianglin Pan, 2020

机译：大规模在线学习环境的学术情感分类与识别方法-基于A-CNN和LSTM-ATT深度学习流水线方法
7. RadImageNet: A Large-scale Radiologic Dataset for Enhancing Deep Learning Transfer Learning Research [O] . Yang Yang, Xueyan Mei, Philip Robson, 2021

机译：RadimageNet：用于增强深度学习转移学习研究的大型放射学数据集

EfficientMPI-AllReduce for large-scale deep learning on GPU-clusters

摘要

著录项

相似文献

相关主题

期刊订阅