首页> 外文期刊>IEEE Transactions on Computers >TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling
【24h】

TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming Scheduling

机译:TurboDL:用细粒度多流调度改善GPU上的CNN训练

获取原文
获取原文并翻译 | 示例

摘要

Graphics Processing Units (GPUs) have evolved as powerful co-processors for the CNN training. Many new features have been introduced into GPUs such as concurrent kernel execution and hyper-Q technology. It is challenging to orchestrate concurrency for CNN (convolutional neural networks) training on GPUs since it may introduce synchronization overhead and poor resource utilization. Unlike previous research which mainly focuses on single layer or coarse-grained optimization, we introduce a critical-path based, asynchronous parallelization mechanism, and propose the optimization technique for the CNN training that takes into account global network architecture and GPU resource usage together. The proposed methods can effectively overlap the synchronization and the computation in different streams. As a result, the training process of CNN is accelerated. We have integrated our methods into Caffe. The experimental results show that the Caffe integrated with our methods can achieve 1.30X performance speedup on average compared with Caffe+cuDNN, and even higher performance speedup can be achieved for deeper, wider, and more complicated networks.
机译:图形处理单元(GPU)已经发展成为CNN培训的强大的协处理器。已经引入了许多新功能,例如并发内核执行和Hyper-Q技术等GPU。协调CNN(卷积神经网络)对GPU训练的并发性挑战,因为它可能引入同步开销和资源利用率差。与以前的研究不同,主要关注单层或粗粒粒度优化,我们引入了基于临界路径的异步并行化机制,并提出了CNN培训的优化技术,该技术考虑了全局网络架构和GPU资源使用量。所提出的方法可以有效地重叠同步和不同流中的计算。结果,CNN的培训过程加速了。我们已将我们的方法整合到Caffe中。实验结果表明,与我们的方法集成的Caffe可以平均实现1.30倍的性能加速,与Caffe + Cudnn相比,可以实现更高,更广泛,更复杂的网络的性能加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号