首页> 外文会议>International Symposium on Microarchitecture >A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks
【24h】

A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks

机译:以网络为中心的硬件/算法共同设计,可加速深神经网络的分布式训练

获取原文

摘要

Training real-world Deep Neural Networks (DNNs) can take an eon (i.e., weeks or months) without leveraging distributed systems. Even distributed training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-the-art distributed training algorithms use a hierarchy of worker-aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group of the workers, and send back the updated weights. This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs). To maximize the benefits of in-network acceleration, the proposed solution, named INCEPTIONN (In-Network Computing to Exchange and Process Training Information Of Neural Networks), uniquely combines hardware and algorithmic innovations by exploiting the following three observations. (1) Gradients are significantly more tolerant to precision loss than weights and as such lend themselves better to aggressive compression without the need for the complex mechanisms to avert any loss. (2) The existing training algorithms only communicate gradients in one leg of the communication, which reduces the opportunities for in-network acceleration of compression. (3) The aggregators can become a bottleneck with compression as they need to compress/decompress multiple streams from their allocated worker group. To this end, we first propose a lightweight and hardware-friendly lossy-compression algorithm for floating-point gradients, which exploits their unique value characteristics. This compression not only enables significantly reducing the gradient communication with practically no loss of accuracy, but also comes with low complexity for direct implementation as a hardware block in the NIC. To maximize the opportunities for compression and avoid the bottleneck at aggregators, we also propose an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner. Without changing the mathematics of training, this algorithm leverages the associative property of the aggregation operator and enables our in-network accelerators to (1) apply compression for all communications, and (2) prevent the aggregator nodes from becoming bottlenecks. Our experiments demonstrate that INCEPTIONN reduces the communication time by 70.9~80.7% and offers 2.2~3.1x speedup over the conventional training system, while achieving the same level of accuracy.
机译:培训现实世界深层神经网络(DNNs)可以采取EON(即周或数月),而利用分布式系统。甚至分布式训练花费过多的时间,这很大馏分在通过网络进行通信的权重和梯度花费。国家的最先进的分布式训练算法使用工人汇聚节点的层次结构。该聚合从他们的分配组的工人多次收到梯度更新,并发送回更新的权重。本文阐述了通过在网络接口卡(NIC)的数据嵌入压缩加速器来减少这种显著通信成本。为了最大限度地提高网络加速,所提出的解决方案名为INCEPTIONN的利益(在网络计算交换和处理培训信息神经网络),独特地结合硬件和通过利用以下三点看法算法的创新。 (1)梯度显著更耐受比的权重精度损失,因此更好地主动的压缩借给自己,而不需要复杂的机制,以避免任何损失。 (2)现有的训练算法仅通信梯度在通信,从而降低了在网络压缩加速度的机会的一条腿。 (3)聚合可以成为压缩的瓶颈,因为他们需要的压缩/解压缩的多个流从它们被分配的工人基。为此,我们首先提出了浮点梯度的轻巧硬件友好的有损压缩算法,它利用其独特的价值特征。这种压缩不仅能使显著减少梯度通信几乎没有损失精度,而且还附带了低直接实施作为NIC硬件模块的复杂性。为了最大限度地压缩的机会和避免在聚合的瓶颈,我们还提出了一种自由聚合训练算法,交流梯度在组中通信双方的腿,而工人集体以分布式的方式执行聚合。在不改变训练的数学,这种算法利用了集成算的关联性,使我们在网络加速器(1)对所有通信应用压缩,和(2)防止聚合节点成为瓶颈。我们的实验证明,INCEPTIONN 70.9〜80.7%,并提供2.2〜3.1X加速比传统的训练系统减少了通信时间,同时实现精确的相同水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号