首页> 外文期刊>Applied Artificial Intelligence >Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
【24h】

Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures

机译:基于异构CPU / GPU架构的CNN分布式学习

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

The convolutional neural networks (CNNs) have proven to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training, such as Caffe, Torch, or TensorFlow. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs where only the convolutional layer is distributed. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500 and 1500 kernels, respectively, best speedups achieve 3.28 x using four CPUs and 2.45 x with three GPUs. Larger datasets will certainly require more than 60-90% of processing time calculating convolutions, and speedups will tend to increase accordingly.
机译:卷积神经网络(CNN)已被证明是功能强大的分类工具,可用于从支票阅读到医学诊断,接近人类感知甚至在某些情况下超越它的任务。但是,要解决的问题变得越来越大,越来越复杂,这意味着CNN更大,导致培训时间更长,甚至采用图形处理单元(GPU)都无法跟上。通过使用更多的处理单元和分布式训练方法(部分由专门用于神经网络训练的几种框架(例如Caffe,Torch或TensorFlow)提供)可以部分解决此问题。但是,这些技术没有充分利用CNN提供的可能的并行化功能以及具有不同处理能力,时钟速度,内存大小等的异构设备的协同使用。本文提出了一种仅对卷积层进行分布的CNN并行训练的新方法。本文分析了网络大小,带宽,批处理大小,设备数量(包括其处理能力)和其他参数的影响。结果表明,该技术能够减少训练时间,而不会影响CPU和GPU的分类性能。对于CIFAR-10数据集,分别使用具有两个卷积层和500个和1500个内核的CNN,使用四个CPU的最佳提速达到3.28倍,使用三个GPU的最佳提速达到2.45倍。较大的数据集当然需要超过60-90%的处理时间来计算卷积,并且速度会相应地提高。

著录项

  • 来源
    《Applied Artificial Intelligence》 |2018年第10期|822-844|共23页
  • 作者单位

    Univ Coimbra, Dept Elect & Comp Engn, Inst Telecomunicacoes, Coimbra, Portugal;

    Univ Beira Interior, Dept Informat, Inst Telecomunicacoes, Covilha, Portugal;

    Ecole Polytech Fed Lausanne, Sch Comp & Commun Sci, Lausanne, Switzerland;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号