首页> 外文会议>International Symposium on Computing and Networking Workshops >Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

【24h】

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

机译：分层分布式 - 内存多领导MPI-ALREDUCE用于深度学习工作负载

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Driven by the increase in complexity and size in Deep Learning models, training models on large-scale GPUs-accelerated clusters is becoming a commonplace. One of the main challenges for distributed training is the collective communication overhead for the very large message size: from several to hundreds of MB. In this paper, we exploit two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters (named lr_lr and lr_rab). In which, one node performs the inter-node data transfer in parallel using other GPUs that are designated as node leaders. Each leader keeps and exchanges a partial result of local reduced values rather than the whole one. Hence we are capable of significantly reducing the time for injecting data into the internode network. We evaluate these algorithms on the discrete-event simulation Simgrid. We show that our algorithms, lr_lr and lr_rab, can cut down the execution time of an Allreduce micro-benchmark that uses logical ring algorithm (lr) by up to 45% and 51%, respectively. In addition, saving the power consumption of network devices of up to 23% and 32% are projected.

机译：受到深度学习模型中复杂性和大小的增加驱动，大规模GPU加速集群的培训模型正在成为一个普遍存在的群体。分布式培训的主要挑战之一是非常大的信息尺寸的集体通信开销：从几到数百MB。在本文中，我们利用了两个分层分布式存储器多领导算法，用于GPU加速群集优化（命名为LR_LR和LR_RAB）。其中，一个节点使用被指定为节点领导者的其他GPU来并行地执行节点间数据传输。每个领导者都会保持并交换本地减少价值的部分结果而不是整个人。因此，我们能够显着减少将数据注入节点网络的时间。我们在离散事件仿真SimGrid上评估这些算法。我们表明我们的算法，LR_LR和LR_RAB可以分别削减使用逻辑环算法（LR）至45％和51％的逻辑环算法（LR）的执行时间。此外，节省高达23％和32％的网络设备的功耗是预计的。

著录项

来源
《International Symposium on Computing and Networking Workshops 》|2018年|570p|共7页
会议地点
作者
Truong Thao Nguyen; Mohamed Wahib; Ryousei Takano;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
Clustering algorithms; Training; Computer architecture; Graphics processing units; Bandwidth; Computational modeling;

机译：聚类算法;培训;计算机架构;图形处理单元;带宽;计算建模;

相似文献

外文文献
中文文献
专利

1. Free Market of Multi-Leader Multi-Follower Mobile Crowdsensing: An Incentive Mechanism Design by Deep Reinforcement Learning [J] . Zhan Yufeng, Liu Chi Harold, Zhao Yinuo, IEEE transactions on mobile computing . 2020 ,第10期

机译：多领导多追随者移动众包的自由市场：深增强学习的激励机制设计
2. E2LG: a multiscale ensemble of LSTM/GAN deep learning architecture for multistep-ahead cloud workload prediction [J] . Yazdanian Peyman, Sharifian Saeed Journal of supercomputing . 2021 ,第10期

机译：E2LG：用于MulteSep-FearA的LSTM / GaN深度学习架构的多尺度集合，用于多余云工作负载预测
3. Adaptive workload adjustment for cyber-physical systems using deep reinforcement learning [J] . Xu Shikang, Koren Israel, Krishna C. Mani Sustainable Computing . 2021 ,第Juna期

机译：利用深增强学习的网络物理系统自适应工作量调整
4. Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads [C] . Truong Thao Nguyen, Mohamed Wahib, Ryousei Takano International Symposium on Computing and Networking Workshops . 2018

机译：深度学习工作量的分层分布式内存多导师MPI-Allreduce
5. Learning Latent Hierarchical Structures via Probabilistic Models and Deep Learning [D] . Arabshahi, Forough 2018

机译：通过概率模型和深度学习来学习潜在的层次结构
6. Perceived Mental Workload Classification Using Intermediate Fusion Multimodal Deep Learning [O] . Tenzing C. Dolmans, Mannes Poel, Jan-Willem J. R. van ’t Klooster, 2020

机译：使用中间融合多模式深度学习感知心理工作量分类
7. A Comprehensive Analysis of Low-Impact Computations in Deep Learning Workloads [O] . Hengyi Li, Zhichen Wang, Xuebin Yue, 2021

机译：深度学习工作量的低冲击计算综合分析

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

摘要

著录项

相似文献

相关主题

期刊订阅