Accelerating Deep Learning using Multiple GPUs and FPGA-Based 10GbE Switch

机译：使用多个GPU和基于FPGA的10GbE交换机加速深度学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A back-propagation algorithm following a gradient descent approach is used for training deep neural networks. Since it iteratively performs a large number of matrix operations to compute the gradients, GPUs (Graphics Processing Units) are efficient especially for the training phase. Thus, a cluster of computers each of which equips multiple GPUs can significantly accelerate the training phase. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and parameter optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10 Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected via network switches, gradient aggregation and optimizers (e.g., SGD, Adagrad, Adam, and SMORMS3) are offloaded to an FPGA-based network switch between a host machine and remote GPUs; thus, the gradient aggregation and optimization are completed in the network. Evaluation results using four remote GPUs connected via the FPGA-based 10GbE switch that implements the four optimizers demonstrate that these optimization algorithms are accelerated by up to 3. 0x and 1. 25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves 98.3% of the 10GbE line rate.

机译：遵循梯度下降法的反向传播算法用于训练深度神经网络。由于迭代执行大量矩阵运算以计算梯度，因此GPU（图形处理单元）尤其在训练阶段非常有效。因此，每台计算机配备多个GPU的计算机集群可以显着加速训练阶段。尽管梯度计算仍是训练的主要瓶颈，但梯度聚合和参数优化会增加通信和计算开销，还应减少通信开销以进一步缩短训练时间。为了解决这个问题，在本文中，多个GPU通过10 Gbit以太网（10GbE）技术与PCI Express（PCIe）互连。由于这些远程GPU通过网络交换机互连，因此梯度聚合和优化器（例如SGD，Adagrad，Adam和SMORMS3）被卸载到主机和远程GPU之间的基于FPGA的网络交换机上;这样，梯度聚合和优化就可以在网络中完成。使用通过实现四个优化器的基于FPGA的10GbE交换机连接的四个远程GPU的评估结果表明，与CPU和GPU实现相比，这些优化算法分别最多可加速3. 0x和1. 25x。同样，基于FPGA的交换机的梯度聚合吞吐量达到了10GbE线速的98.3％。

著录项

来源
《Euromicro International Conference on Parallel, Distributed and Network-Based Processing》|2020年|102-109|共8页
会议地点
作者
Tomoya Itsubo; Michihiro Koibuchi; Hideharu Amano; Hiroki Matsutani;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Optimization; Switches; Training; Graphics processing units; Acceleration; Machine learning; Computational modeling;

机译：优化;开关;培训;图形处理单元;加速;机器学习;计算模型;

相似文献

外文文献
中文文献
专利

1. An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs [J] . Tomoya ITSUBO, Michihiro KOIBUCHI, Hideharu AMANO, IEICE transactions on information and systems . 2021,第12期

机译：一种基于FPGA的优化设计，用于多个GPU的分布式深度学习
2. Scalable Healthcare Assessment for Diabetic Patients Using Deep Learning on Multiple GPUs [J] . Sierra-Sosa Daniel, Garcia-Zapirain Begonya, Castillo Cristian, IEEE transactions on industrial informatics . 2019,第10期

机译：在多个GPU上使用深度学习对糖尿病患者进行可扩展的医疗保健评估
3. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing [J] . Fan Zhang, Guojun Li, Wei Li, Sensors . 2016,第4期

机译：使用多个CPU / GPU深度协作计算加速星载SAR成像
4. Accelerating Deep Learning using Multiple GPUs and FPGA-Based 10GbE Switch [C] . Tomoya Itsubo, Michihiro Koibuchi, Hideharu Amano, Euromicro International Conference on Parallel, Distributed and Network-Based Processing . 2020

机译：使用多个GPU和基于FPGA的10GBE开关加速深度学习
5. Accelerating discontinuous Galerkin method and finite difference method by using multiple GPUs with CUDA. [D] . Mu, Dawei. 2015

机译：通过使用带有CUDA的多个GPU来加速不连续的Galerkin方法和有限差分方法。
6. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing [O] . Fan Zhang, Guojun Li, Wei Li, 2016

机译：使用多个CPU / GPU深度协作计算加速星载SAR成像
7. Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs [O] . Xueying Wang, Guangli Li, Xiao Dong, 2020

机译：加速GPU上的跨层数据重用深入学习推断

Accelerating Deep Learning using Multiple GPUs and FPGA-Based 10GbE Switch

摘要

著录项

相似文献

相关主题

期刊订阅