首页> 外文会议>International Symposium on Microarchitecture >Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
【24h】

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

机译:以记忆为中心的架构Winograd层的多维平行训练

获取原文

摘要

Accelerating neural network training is critical in exploring design space of neural networks. Data parallelism is commonly used to accelerate training for Convolutional Neural Networks (CNN) where input batch is distributed across the multiple workers; however, the increase in communication of weight gradients across the workers limits scalability. In this work, we propose multi-dimensional parallel (MDP) training of convolution layer by exploiting both data parallelism and intratile parallelism available in Winograd transformed convolution. Workers are organized across two dimensions - one dimension exploiting intra-tile parallelism while the other dimension exploits data parallelism. MDP reduces the amount of communication necessary for weight gradients since weight gradients are only communicated across the data parallelism dimension. However, Winograd transform fundamentally requires more data accesses and the proposed MDP architecture also introduces a new type of communication which we refer to as tile transfer - gather/scatter of Winograd domain feature maps (tiles). We propose a scalable near-data processing (NDP) architecture to minimize the cost of data accesses through 3D stacked memory while leveraging a memory-centric network organization to provide high-connectivity between the workers with intra-tile parallelism to accelerate tile transfer. To minimize tile gathering communication overhead, we exploit prediction of activation of spatial domain neurons in order to remove the communication of tiles that are transformed to non-activated neurons. In order to balance the communication required for weight gradients and tile transfer, we also propose a reconfigurable memory-centric network architecture that reconfigures network channel connectivity between the workers for each convolution layer. Our evaluations show that the proposed MDP with NDP architecture accelerates training by 2.7×, 9.5-21× compared to the data parallel training with the NDP architecture and a multi-GPU system, respectively.
机译:加速神经网络培训对于探索神经网络的设计空间至关重要。数据并行性通常用于加速卷积神经网络(CNN)的培训,其中输入批处理分布在多个工人上;然而,在工人的重量梯度通信的增加限制了可扩展性。在这项工作中,我们通过利用WinoGrad转换卷积的数据并行和平行,提出了对卷积层的多维平行(MDP)训练。工人横跨两个维度组织 - 一个维度泛滥的平台并行,而另一个维度利用数据并行性。 MDP减少了重量梯度所需的通信量,因为只有权重梯度仅在数据并行尺寸上传送。然而,WinoGrad转换根本需要更多的数据访问,并且所提出的MDP架构还引入了一种新的通信,我们将指示为Tile Transfer-Content / Simpty的WinoGrad域特征映射(图块)。我们提出了一个可扩展的近数据处理(NDP)架构,以减少数据的成本访问,通过3D堆栈存储器,同时利用内存为中心的网络组织提供工人之间的高连接性瓷砖内部并行加速瓦转移。为了最小化瓦片采集通信开销,我们利用空间域神经元激活的预测,以去除转化为非激活神经元的瓦片的通信。为了平衡重量梯度和平铺传输所需的通信,我们还提出了一种可重构的内存中心网络架构,可重新配置每个卷积层的工人之间的网络信道连接。我们的评估表明,与NDP架构和多GPU系统的数据并行培训相比,拟议的MDP与NDP架构加速了2.7倍,9.5-21倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号