Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

机译：深度学习工作量的分层分布式内存多导师MPI-Allreduce

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discrete-event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce micro-benchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.

机译：在深度学习模型的复杂性和规模不断增加的驱动下，大规模GPU加速集群上的训练模型变得司空见惯。分布式培训的主要挑战之一是非常大的消息大小的集体通信开销：从几MB到数百MB。在本文中，我们利用了针对GPU加速集群优化的两种分层的分布式内存多线程Allreduce算法（分别称为lr_lr和lr_rab）。其中，一个节点使用指定为节点领导者的其他GPU并行执行节点间数据传输。每个领导者都保留并交换局部降低值的部分结果，而不是整个结果。因此，我们能够显着减少将数据注入节点间网络的时间。我们在离散事件仿真Simgrid上评估这些算法。我们证明了我们的算法lr_lr和lr_rab可以将使用逻辑环算法（lr）的Allreduce微基准的执行时间分别减少多达45％和51％。此外，预计可节省多达23％和32％的网络设备功耗。

著录项

来源
《International Symposium on Computing and Networking Workshops》|2018年|216-222|共7页
会议地点 Takayama(JP)
作者
Truong Thao Nguyen; Mohamed Wahib; Ryousei Takano;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Clustering algorithms; Training; Computer architecture; Graphics processing units; Bandwidth; Computational modeling;

机译：聚类算法；训练;计算机架构;图形处理单元；带宽;计算建模;

相似文献

外文文献
中文文献
专利

1. Free Market of Multi-Leader Multi-Follower Mobile Crowdsensing: An Incentive Mechanism Design by Deep Reinforcement Learning [J] . Zhan Yufeng, Liu Chi Harold, Zhao Yinuo, IEEE transactions on mobile computing . 2020,第10期

机译：多领导多追随者移动众包的自由市场：深增强学习的激励机制设计
2. E2LG: a multiscale ensemble of LSTM/GAN deep learning architecture for multistep-ahead cloud workload prediction [J] . Yazdanian Peyman, Sharifian Saeed Journal of supercomputing . 2021,第10期

机译：E2LG：用于MulteSep-FearA的LSTM / GaN深度学习架构的多尺度集合，用于多余云工作负载预测
3. Adaptive workload adjustment for cyber-physical systems using deep reinforcement learning [J] . Xu Shikang, Koren Israel, Krishna C. Mani Sustainable Computing . 2021,第Juna期

机译：利用深增强学习的网络物理系统自适应工作量调整
4. Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads [C] . Truong Thao Nguyen, Mohamed Wahib, Ryousei Takano International Symposium on Computing and Networking Workshops . 2018

机译：分层分布式 - 内存多领导MPI-ALREDUCE用于深度学习工作负载
5. Learning Latent Hierarchical Structures via Probabilistic Models and Deep Learning [D] . Arabshahi, Forough 2018

机译：通过概率模型和深度学习来学习潜在的层次结构
6. Perceived Mental Workload Classification Using Intermediate Fusion Multimodal Deep Learning [O] . Tenzing C. Dolmans, Mannes Poel, Jan-Willem J. R. van ’t Klooster, 2020

机译：使用中间融合多模式深度学习感知心理工作量分类
7. A Comprehensive Analysis of Low-Impact Computations in Deep Learning Workloads [O] . Hengyi Li, Zhichen Wang, Xuebin Yue, 2021

机译：深度学习工作量的低冲击计算综合分析

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅