...
首页> 外文期刊>Parallel Computing >Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
【24h】

Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?

机译:针对深度学习工作负载(MPI,MPI + NCCL或NCCL2)的优化的大消息广播?

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL) frameworks like Caffe and Microsoft Cognitive Toolkit (CNTK) pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have emerged to deal with DL workloads. In this paper, we address these new challenges for MPI runtimes and propose two new designs to deal with them: (1) A pipelined chain (PC) design for MPI_Bcast that provides efficient intra- and inter-node communication of GPU buffers, and (2) A Topology-Aware pipelined chain (TA-PC) design for systems with multiple GPUs to fully exploit all the available PCIe links available within a multi-GPU node.To highlight the benefits of our designs, we present an in-depth performance landscape for the proposed MPI_Bcast (MPI) designs, our earlier NCCL-based MPI_Bcast (MPI+NCCL1) design, and ncclBroadcast (NCCL2) design. The proposed designs offer up to 14 x and 16.6 x better performance than MPI+NCCL1 based solutions for intra- and inter-node broadcast latency, respectively. With the recent introduction of NCCL2 (inter-node capable) library, we have enhanced our performance results by adding comparisons for the proposed MPI_Bcast designs as well as ncclBroadcast (NCCL2) design. We report up to 10 x better performance for small and medium message sizes and comparable performance for large message sizes. We also observed that the TA-PC design is up to 50% better than the PC design for MPI_Bcast to 64 GPUs. Furthermore, we provide application level performance comparison using a CUDA-Aware version of CNTK called CA-CNTK. The proposed MPI_Bcast designs provide up to 7% improvement over MPI+NCCL based solutions for data parallel training of the VGG network on 128 GPUs. We present our performance evaluation on three GPU clusters with diverse characteristics: (1) KESCH; a dense multi-GPU system with 8 K80 GPU cards per node, (2) RI2; with a single K80 GPU card per node, and (3) Owens; with a single P100 GPU per node. (C) 2019 Elsevier B.V. All rights reserved.
机译:传统上,MPI运行时是为具有大量节点的集群设计的。但是,随着MPI + CUDA应用程序和节点数量相对较少的GPU群集的出现,需要针对此类系统设计有效的通信方案。加上Caffe和Microsoft Cognitive Toolkit(CNTK)等深度学习(DL)框架带来的新应用程序工作负载,由于训练阶段GPU缓冲区的消息传递非常大,因此带来了其他设计约束。在这种情况下,出现了诸如NVIDIA NCCL之类的专用库来处理DL工作负载。在本文中,我们针对MPI运行时解决了这些新挑战,并提出了两种新的设计来应对这些挑战:(1)MPI_Bcast的流水链(PC)设计可提供GPU缓冲区的有效的节点内和节点间通信,以及( 2)拓扑感知流水线链(TA-PC)设计用于具有多个GPU的系统,以充分利用多GPU节点中可用的所有可用PCIe链接。为了突出我们设计的优势,我们展示了深入的性能提议的MPI_Bcast(MPI)设计,我们较早的基于NCCL的MPI_Bcast(MPI + NCCL1)设计和ncclBroadcast(NCCL2)设计的前景。与基于MPI + NCCL1的解决方案相比,针对节点内和节点间的广播延迟,建议的设计分别提供了高达14倍和16.6倍的性能提升。随着最近引入的NCCL2(支持节点间)库的出现,我们通过增加对建议的MPI_Bcast设计和ncclBroadcast(NCCL2)设计的比较来提高性能结果。对于小型和中型邮件,我们报告的性能提高了10倍,对于大型邮件,性能达到了可比的性能。我们还观察到,对于64个GPU的MPI_Bcast,TA-PC设计要比PC设计好50%。此外,我们使用称为CA-CNTK的CNTK的CUDA感知版本提供应用程序级别的性能比较。与基于MPI + NCCL的解决方案相比,针对基于128个GPU的VGG网络的数据并行训练,建议的MPI_Bcast设计可提供高达7%的改进。我们对三个具有不同特征的GPU集群进行性能评估:(1)KESCH;密集的多GPU系统,每个节点有8个K80 GPU卡,(2)RI2;每个节点带有一个K80 GPU卡,以及(3)Owens;每个节点有一个P100 GPU。 (C)2019 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号