首页> 外文会议>IEEE Conference on Computer Vision and Pattern Recognition >FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
【24h】

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

机译:FireCaffe:在计算群集上进行深度神经网络训练的近线性加速

获取原文

摘要

Long training times for high-accuracy deep neural networks (DNNs) impede research into new DNN architectures and slow the development of high-accuracy DNNs. In this paper we present FireCaffe, which successfully scales deep neural network training across a cluster of GPUs. We also present a number of best practices to aid in comparing advancements in methods for scaling and accelerating the training of deep neural networks. The speed and scalability of distributed algorithms is almost always limited by the overhead of communicating between servers, DNN training is not an exception to this rule. Therefore, the key consideration here is to reduce communication overhead wherever possible, while not degrading the accuracy of the DNN models that we train. Our approach has three key pillars. First, we select network hardware that achieves high bandwidth between GPU servers - Infiniband or Cray interconnects are ideal for this. Second, we consider a number of communication algorithms, and we find that reduction trees are more efficient and scalable than the traditional parameter server approach. Third, we optionally increase the batch size to reduce the total quantity of communication during DNN training, and we identify hyperparameters that allow us to reproduce the small-batch accuracy while training with large batch sizes. When training GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup, respectively, when training on a cluster of 128 GPUs.
机译:高精度深度神经网络(DNN)的培训时间长,阻碍了对新DNN体系结构的研究,并减慢了高精度DNN的开发速度。在本文中,我们介绍了FireCaffe,它成功地在一组GPU上扩展了深度神经网络训练。我们还提出了一些最佳实践,以帮助比较缩放和加速深度神经网络训练方法的进步。分布式算法的速度和可伸缩性几乎总是受到服务器之间通信开销的限制,DNN训练也不是该规则的例外。因此,此处的主要考虑因素是在不降低我们训练的DNN模型的准确性的情况下,尽可能减少通信开销。我们的方法具有三个主要支柱。首先,我们选择可在GPU服务器之间实现高带宽的网络硬件-Infiniband或Cray互连是理想的选择。其次,我们考虑了许多通信算法,并且发现减少树比传统的参数服务器方法更有效,更可伸缩。第三,我们可以选择增加批次大小以减少DNN训练期间的通信总量,并确定超参数,这些参数可以使我们在以大批次训练时重现小批次的准确性。在ImageNet上训练GoogLeNet和网络中的网络时,在128个GPU的集群上进行训练时,我们分别实现了47倍和39倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号