首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines
【24h】

ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines

机译:ZNN - 一种快速且可扩展算法,用于培训多核和多核共享内存机器上的3D卷积网络

获取原文

摘要

Convolutional networks (ConvNets) have become a popular approach to computer vision. It is important to accelerate ConvNet training, which is computationally costly. We propose a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs. Applying Brent's theorem to the task dependency graph implies that linear speedup with the number of processors is attainable within the PRAM model of parallel computation, for wide network architectures. To attain such performance on real shared-memory machines, our algorithm computes convolutions converging on the same node of the network with temporal locality to reduce cache misses, and sums the convergent convolution outputs via an almost wait-free concurrent method to reduce time spent in critical sections. We implement the algorithm with a publicly available software package called ZNN. Benchmarking with multi-core CPUs shows that ZNN can attain speedup roughly equal to the number of physical cores. We also show that ZNN can attain over 90× speedup on a many-core CPU (Xeon Phi? Knights Corner). These speedups are achieved for network architectures with widths that are in common use. The task parallelism of the ZNN algorithm is suited to CPUs, while the SIMD parallelism of previous algorithms is compatible with GPUs. Through examples, we show that ZNN can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch. ZNN may be less costly to develop and maintain, due to the relative ease of general-purpose CPU programming.
机译:卷积网络(Convnets)已成为计算机愿景的流行方法。加速Convnet培训非常重要,这是计算地昂贵的。我们提出了一种基于分解成一组任务的新颖并行算法,其中大部分是卷积或FFT。将Brent的定理应用于任务依赖关系图意味着对于广泛的网络架构,可以在PRAM模型内实现具有处理器数量的线性加速。为了获得真实的共享存储器上的这种性能,我们的算法计算在网络的同一节点上的卷积,其中包含时间途径,以减少缓存未命中,并通过几乎等待的并发方法和缩小时间以减少所花费的时间关键部分。我们使用名为ZnN的公开软件包实现算法。使用多核CPU的基准测试显示ZnN可以获得大致等于物理核心数的加速。我们还表明,ZnN可以在许多核心CPU(Xeon Phi?骑士角)上获得超过90倍的加速。对于具有常用宽度的网络架构,可以实现这些加速度。 ZnN算法的任务并行性适用于CPU,而先前算法的SIMD并行性与GPU兼容。通过示例,我们表明ZnN可以比某些GPU实现更快或更慢,具体取决于网络架构的细节,内核大小和输出补丁的密度和大小的细节。由于通用C​​PU编程相对容易,ZNN可能更昂贵地开发和维护。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号