Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

Awan Ammar Ahmad; Manian Karthik Vadambacheri; Chu Ching-Hsiang; Subramoni Hari; Panda Dhabaleswar K.

首页> 外文期刊>Parallel Computing >Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

【24h】

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

机译：针对深度学习工作负载（MPI，MPI + NCCL或NCCL2）的优化的大消息广播？

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this paper, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) A pipelined chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) A Topology-Aware pipelined chain (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node.To highlight the benefits of our designs, we present an in-depth performance landscape for the proposed MPI_Bcast (MPI) designs, our earlier NCCL-based MPI_Bcast (MPI+NCCL1) design, and ncclBroadcast (NCCL2) design. The proposed designs offer up to 14 x and 16.6 x better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency, respectively. With the recent introduction of NCCL2 (inter-node capable) library, we have enhanced our performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10 x better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. Furthermore, we provide application level performance comparison using a CUDA-Aware version of CNTK called CA-CNTK. The proposed MPI_Bcast designs provide up to 7% improvement over MPI+NCCL based solutions for data parallel training of the VGG network on 128 GPUs. We present our performance evaluation on three GPU clusters with diverse characteristics: (1) KESCH; a dense multi-GPU system with 8 K80 GPU cards per node, (2) RI2; with a single K80 GPU card per node, and (3) Owens; with a single P100 GPU per node. (C) 2019 Elsevier B.V. All rights reserved.

机译：传统上，MPI运行时是为具有大量节点的集群设计的。但是，随着MPI + CUDA应用程序和节点数量相对较少的GPU群集的出现，需要针对此类系统设计有效的通信方案。加上Caffe和Microsoft Cognitive Toolkit（CNTK）等深度学习（DL）框架带来的新应用程序工作负载，由于训练阶段GPU缓冲区的消息传递非常大，因此带来了其他设计约束。在这种情况下，出现了诸如NVIDIA NCCL之类的专用库来处理DL工作负载。在本文中，我们针对MPI运行时解决了这些新挑战，并提出了两种新的设计来应对这些挑战：（1）MPI_Bcast的流水链（PC）设计可提供GPU缓冲区的有效的节点内和节点间通信，以及（ 2）拓扑感知流水线链（TA-PC）设计用于具有多个GPU的系统，以充分利用多GPU节点中可用的所有可用PCIe链接。为了突出我们设计的优势，我们展示了深入的性能提议的MPI_Bcast（MPI）设计，我们较早的基于NCCL的MPI_Bcast（MPI + NCCL1）设计和ncclBroadcast（NCCL2）设计的前景。与基于MPI + NCCL1的解决方案相比，针对节点内和节点间的广播延迟，建议的设计分别提供了高达14倍和16.6倍的性能提升。随着最近引入的NCCL2（支持节点间）库的出现，我们通过增加对建议的MPI_Bcast设计和ncclBroadcast（NCCL2）设计的比较来提高性能结果。对于小型和中型邮件，我们报告的性能提高了10倍，对于大型邮件，性能达到了可比的性能。我们还观察到，对于64个GPU的MPI_Bcast，TA-PC设计要比PC设计好50％。此外，我们使用称为CA-CNTK的CNTK的CUDA感知版本提供应用程序级别的性能比较。与基于MPI + NCCL的解决方案相比，针对基于128个GPU的VGG网络的数据并行训练，建议的MPI_Bcast设计可提供高达7％的改进。我们对三个具有不同特征的GPU集群进行性能评估：（1）KESCH；密集的多GPU系统，每个节点有8个K80 GPU卡，（2）RI2；每个节点带有一个K80 GPU卡，以及（3）Owens；每个节点有一个P100 GPU。（C）2019 Elsevier B.V.保留所有权利。

著录项

来源
《Parallel Computing》 |2019年第7期|141-152|共12页
作者
Awan Ammar Ahmad; Manian Karthik Vadambacheri; Chu Ching-Hsiang; Subramoni Hari; Panda Dhabaleswar K.;
展开▼
作者单位

Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA;

Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA;

Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA;

Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA;

Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
HPC; Distributed d learning; MPI_Bcast; NCCL; CODA-Aware MPI;

机译：HPC;分布式学习;MPI_Bcast;NCCL;CODA感知MPI;

相似文献

外文文献
中文文献
专利

1. Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2? [J] . Awan Ammar Ahmad, Manian Karthik Vadambacheri, Chu Ching-Hsiang, Parallel Computing . 2019,第Jula期

机译：优化的大型信息广播，用于深度学习工作负载：MPI，MPI + NCCL或NCCL2？
2. Self-Tuning Sectorization: Deep Reinforcement Learning Meets Broadcast Beam Optimization [J] . Nature reviews Drug discovery . 2020,第6期

机译：自我调整课程：深度加固学习符合广播梁优化
3. Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms [J] . Hasanov Khalid, Quintin Jean-Noel, Lastovetsky Alexey Simulation modelling practice and theory: International journal of the Federation of European Simulation Societies . 2015,第Pta1期

机译：极端规模平台上MPI广播算法的拓扑无关性优化
4. Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads [C] . Truong Thao Nguyen, Mohamed Wahib, Ryousei Takano International Symposium on Computing and Networking Workshops . 2018

机译：深度学习工作量的分层分布式内存多导师MPI-Allreduce
5. Galo: Guided Automated Learning for Query Workload Re-optimization [D] . Damasio, Guilherme Fetter. 2018

机译：Galo：用于指导工作负载重新优化的引导式自动学习
6. Reply to Jue et al. Value of MRI to Improve Deep Learning Model That Identifies High-Grade Prostate Cancer. Comment on Gentile et al. Optimized Identification of High-Grade Prostate Cancer by Combining Different PSA Molecular Forms and PSA Density in a Deep Learning Model. Diagnostics 2021 11 335 [O] . Francesco Gentile, Matteo Ferro, Bartolomeo Della Ventura, 2021

机译：回复jue等人。 MRI的价值改善识别高档前列腺癌的深度学习模型。评论Gentile等人。通过在深层学习模型中结合不同PSA分子形式和PSA密度来优化高级前列腺癌的优化鉴定。诊断202111335
7. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis [O] . 2015

机译：与mpICH2-Nemesis的高效缓存，内部节点，大消息mpI通信

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅