首页> 外文会议>IEEE Conference on Computer Communications >Preemptive All-reduce Scheduling for Expediting Distributed DNN Training
【24h】

Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

机译:抢占式全缩减调度,加快分布式DNN培训

获取原文

摘要

Data-parallel training is widely used for scaling DNN training over large datasets, using the parameter server or all-reduce architecture. Communication scheduling has been promising to accelerate distributed DNN training, which aims to overlap communication with computation by scheduling the order of communication operations. We identify two limitations of previous communication scheduling work. First, layer-wise computation graph has been a common assumption, while modern machine learning frameworks (e.g., TensorFlow) use a sophisticated directed acyclic graph (DAG) representation as the execution model. Second, the default sizes of tensors are often less than optimal for transmission scheduling and bandwidth utilization. We propose PACE, a communication scheduler that preemptively schedules (potentially fused) all-reduce tensors based on the DAG of DNN training, guaranteeing maximal overlapping of communication with computation and high bandwidth utilization. The scheduler contains two integrated modules: given a DAG, we identify the best tensor-preemptive communication schedule that minimizes the training time; exploiting the optimal communication scheduling as an oracle, a dynamic programming approach is developed for generating a good DAG, which merges small communication tensors for efficient bandwidth utilization. Experiments in a GPU testbed show that PACE accelerates training with representative system configurations, achieving up to 36% speed-up compared with state-of-the-art solutions.
机译:数据并行训练被广泛用于使用参数服务器或全约简架构在大型数据集上扩展DNN训练。通信调度已有望加速分布式DNN培训,该培训旨在通过调度通信操作的顺序使通信与计算重叠。我们确定了以前的通信调度工作的两个局限性。首先,逐层计算图是一个普遍的假设,而现代机器学习框架(例如TensorFlow)使用复杂的有向无环图(DAG)表示作为执行模型。其次,张量的默认大小通常小于传输调度和带宽利用的最佳大小。我们提出了一种通信调度程序PACE,该通信调度程序基于DNN训练的DAG抢先调度(可能融合)全归约张量,从而确保通信与计算的最大重叠以及高带宽利用率。调度程序包含两个集成模块:给定一个DAG,我们确定最佳的张量可抢先的通信调度,以最大程度地减少训练时间;利用最佳通信调度作为预言机,开发了一种动态编程方法来生成良好的DAG,该方法合并了较小的通信张量以有效地利用带宽。在GPU测试平台上进行的实验表明,PACE通过具有代表性的系统配置来加快培训速度,与最新解决方案相比,速度提高了36%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号