首页> 外文会议>Euromicro International Conference on Parallel, Distributed and Network-Based Processing >Accelerating Deep Learning using Multiple GPUs and FPGA-Based 10GbE Switch
【24h】

Accelerating Deep Learning using Multiple GPUs and FPGA-Based 10GbE Switch

机译:使用多个GPU和基于FPGA的10GbE交换机加速深度学习

获取原文

摘要

A back-propagation algorithm following a gradient descent approach is used for training deep neural networks. Since it iteratively performs a large number of matrix operations to compute the gradients, GPUs (Graphics Processing Units) are efficient especially for the training phase. Thus, a cluster of computers each of which equips multiple GPUs can significantly accelerate the training phase. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and parameter optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10 Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected via network switches, gradient aggregation and optimizers (e.g., SGD, Adagrad, Adam, and SMORMS3) are offloaded to an FPGA-based network switch between a host machine and remote GPUs; thus, the gradient aggregation and optimization are completed in the network. Evaluation results using four remote GPUs connected via the FPGA-based 10GbE switch that implements the four optimizers demonstrate that these optimization algorithms are accelerated by up to 3. 0x and 1. 25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves 98.3% of the 10GbE line rate.
机译:遵循梯度下降法的反向传播算法用于训练深度神经网络。由于迭代执行大量矩阵运算以计算梯度,因此GPU(图形处理单元)尤其在训练阶段非常有效。因此,每台计算机配备多个GPU的计算机集群可以显着加速训练阶段。尽管梯度计算仍是训练的主要瓶颈,但梯度聚合和参数优化会增加通信和计算开销,还应减少通信开销以进一步缩短训练时间。为了解决这个问题,在本文中,多个GPU通过10 Gbit以太网(10GbE)技术与PCI Express(PCIe)互连。由于这些远程GPU通过网络交换机互连,因此梯度聚合和优化器(例如SGD,Adagrad,Adam和SMORMS3)被卸载到主机和远程GPU之间的基于FPGA的网络交换机上;这样,梯度聚合和优化就可以在网络中完成。使用通过实现四个优化器的基于FPGA的10GbE交换机连接的四个远程GPU的评估结果表明,与CPU和GPU实现相比,这些优化算法分别最多可加速3. 0x和1. 25x。同样,基于FPGA的交换机的梯度聚合吞吐量达到了10GbE线速的98.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号