首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs
【24h】

Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs

机译:分布式系统和云TPU上的快速深度神经网络培训

获取原文
获取原文并翻译 | 示例

摘要

Since its creation, the ImageNet-1k benchmark set has played a significant role as a benchmark for ascertaining the accuracy of different deep neural net (DNN) models on the image classification problem. Moreover, in recent years it has also served as the principal benchmark for assessing different approaches to DNN training. Finishing a 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10(18) single precision operations in total. On the other hand, the world's current fastest supercomputer can finish 3 x 10(17) single precision operations per second (according to the Nov 2018 Top 500 results). If we can make full use of the computing capability of the fastest supercomputer, we should be able to finish the training in several seconds. Over the last two years, researchers have focused on closing this significant performance gap through scaling DNN training to larger numbers of processors. Most successful approaches to scaling ImageNet training have used the synchronous mini-batch stochastic gradient descent (SGD). However, to scale synchronous SGD one must also increase the batch size used in each iteration. Thus, for many researchers, the focus on scaling DNN training has translated into a focus on developing training algorithms that enable increasing the batch size in data-parallel synchronous SGD without losing accuracy over a fixed number of epochs. In this paper, we investigate supercomputers' capability of speeding up DNN training. Our approach is to use a large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on five neural networks: AlexNet, AlexNet-BN, GNMT, ResNet-50, and ResNet-50-v2 trained with large datasets while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from Goyal et al. [1] , our approach shows higher test accuracy on batch sizes that are larger than 16K. When we use the same baseline, our results are better than Goyal et al. for all the batch sizes (Fig. 20 ). Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe, Facebook's PyTorch, and Google's TensorFlow. The difference between this paper and the conference-version of our work [2] includes: (1) we implement our approach on Google's cloud Tensor Processing Unit (TPU) platform, which verifies our previous success on CPUs and GPUs. (2) we scale the batch size of ResNet-50-v2 to 32K and achieve 76.3 percent accuracy, which is better than the 75.3 percent accuracy achieved in our conference paper. (3) we apply our approach to Google's Neural Machine Translation (GNMT) application, which helps us to achieves 4x speedup on the cloud TPUs.
机译:自创建以来,ImageNet-1k基准组在确定图像分类问题上不同的深度神经网络(DNN)模型的准确性方面发挥了重要的基准作用。此外,近年来,它也已成为评估DNN培训不同方法的主要基准。在NVIDIA M40 GPU上使用ResNet-50完成90时代的ImageNet-1k培训。此培训总共需要10(18)个单精度操作。另一方面,世界上最快的超级计算机每秒可以完成3 x 10(17)个单精度运算(根据2018年11月前500强结果)。如果我们可以充分利用最快的超级计算机的计算能力,那么我们应该能够在几秒钟内完成训练。在过去的两年中,研究人员一直致力于通过将DNN培训扩展到更多的处理器来缩小这一显着的性能差距。缩放ImageNet训练最成功的方法是使用同步小批量随机梯度下降(SGD)。但是,要缩放同步SGD,还必须增加每次迭代中使用的批处理大小。因此,对于许多研究人员而言,对扩展DNN训练的关注已转变为对开发训练算法的关注,该训练算法能够在数据并行同步SGD中增加批处理大小,而不会在固定数量的时期内损失准确性。在本文中,我们研究了超级计算机加速DNN训练的能力。我们的方法是使用大批量,并由分层自适应速率缩放(LARS)算法提供支持,以有效利用大量计算资源。我们的方法是通用的,因为我们根据经验评估了五个神经网络的有效性:AlexNet,AlexNet-BN,GNMT,ResNet-50和ResNet-50-v2经过大型数据集的训练,同时保持了最新的测试准确性。与Goyal等人先前研究的基线相比。 [1],我们的方法在大于16K的批量上显示出更高的测试精度。当我们使用相同的基线时,我们的结果优于Goyal等人。对于所有批次大小(图20)。使用2,048个Intel Xeon Platinum 8160处理器,我们可以将100历时的AlexNet培训时间从数小时减少到11分钟。借助2,048个Intel Xeon Phi 7250处理器,我们将90时代的ResNet-50培训时间从数小时减少到20分钟。我们的实现是开源的,并已在Intel发行的Caffe,Facebook的PyTorch和Google的TensorFlow中发布。本文与我们的工作的会议版[2]之间的区别包括:(1)我们在Google的云Tensor Processing Unit(TPU)平台上实现了我们的方法,该方法验证了我们先前在CPU和GPU上的成功。 (2)我们将ResNet-50-v2的批量大小缩放到32K,并达到76.3%的准确度,这比我们会议论文中达到的75.3%的准确度要好。 (3)我们将方法应用于Google的神经机器翻译(GNMT)应用程序,这有助于我们将云TPU的速度提高4倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号