Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

机译：以记忆为中心的架构Winograd层的多维平行训练

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Accelerating neural network training is critical in exploring design space of neural networks. Data parallelism is commonly used to accelerate training for Convolutional Neural Networks (CNN) where input batch is distributed across the multiple workers; however, the increase in communication of weight gradients across the workers limits scalability. In this work, we propose multi-dimensional parallel (MDP) training of convolution layer by exploiting both data parallelism and intratile parallelism available in Winograd transformed convolution. Workers are organized across two dimensions - one dimension exploiting intra-tile parallelism while the other dimension exploits data parallelism. MDP reduces the amount of communication necessary for weight gradients since weight gradients are only communicated across the data parallelism dimension. However, Winograd transform fundamentally requires more data accesses and the proposed MDP architecture also introduces a new type of communication which we refer to as tile transfer - gather/scatter of Winograd domain feature maps (tiles). We propose a scalable near-data processing (NDP) architecture to minimize the cost of data accesses through 3D stacked memory while leveraging a memory-centric network organization to provide high-connectivity between the workers with intra-tile parallelism to accelerate tile transfer. To minimize tile gathering communication overhead, we exploit prediction of activation of spatial domain neurons in order to remove the communication of tiles that are transformed to non-activated neurons. In order to balance the communication required for weight gradients and tile transfer, we also propose a reconfigurable memory-centric network architecture that reconfigures network channel connectivity between the workers for each convolution layer. Our evaluations show that the proposed MDP with NDP architecture accelerates training by 2.7×, 9.5-21× compared to the data parallel training with the NDP architecture and a multi-GPU system, respectively.

机译：加速神经网络培训对于探索神经网络的设计空间至关重要。数据并行性通常用于加速卷积神经网络（CNN）的培训，其中输入批处理分布在多个工人上;然而，在工人的重量梯度通信的增加限制了可扩展性。在这项工作中，我们通过利用WinoGrad转换卷积的数据并行和平行，提出了对卷积层的多维平行（MDP）训练。工人横跨两个维度组织 - 一个维度泛滥的平台并行，而另一个维度利用数据并行性。 MDP减少了重量梯度所需的通信量，因为只有权重梯度仅在数据并行尺寸上传送。然而，WinoGrad转换根本需要更多的数据访问，并且所提出的MDP架构还引入了一种新的通信，我们将指示为Tile Transfer-Content / Simpty的WinoGrad域特征映射（图块）。我们提出了一个可扩展的近数据处理（NDP）架构，以减少数据的成本访问，通过3D堆栈存储器，同时利用内存为中心的网络组织提供工人之间的高连接性瓷砖内部并行加速瓦转移。为了最小化瓦片采集通信开销，我们利用空间域神经元激活的预测，以去除转化为非激活神经元的瓦片的通信。为了平衡重量梯度和平铺传输所需的通信，我们还提出了一种可重构的内存中心网络架构，可重新配置每个卷积层的工人之间的网络信道连接。我们的评估表明，与NDP架构和多GPU系统的数据并行培训相比，拟议的MDP与NDP架构加速了2.7倍，9.5-21倍。

著录项

来源
《International Symposium on Microarchitecture》|2018年|xxiv 493 p. :|共14页
会议地点
作者
Byungchul Hong; Yeonju Ro; John Kim;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
Convolution; Parallel processing; Training; Transforms; Computer architecture; Topology; Neural networks;

机译：卷积;并行处理;培训;转换;计算机架构;拓扑;神经网络;

相似文献

外文文献
中文文献
专利

1. A Bi-layered Parallel Training Architecture for Large-Scale Convolutional Neural Networks [J] . Chen Jianguo, Li Kenli, Bilal Kashif, IEEE Transactions on Parallel and Distributed Systems . 2019,第5期

机译：大规模卷积神经网络的双层并行训练架构
2. A Bi-layered Parallel Training Architecture for Large-Scale Convolutional Neural Networks [J] . Chen Jianguo, Li Kenli, Bilal Kashif, IEEE Transactions on Parallel and Distributed Systems . 2019,第5期

机译：用于大型卷积神经网络的双层平行训练架构
3. Multilevel neuronal architecture to resolve classification problems with large training sets: Parallelization of the training process [J] . Martinez Lopez Francisco Javier, Torres Arriaza Jose Antonio, Martinez Puertas Sergio, Journal of computational science . 2016,第sepa期

机译：解决大型训练集分类问题的多层神经元体系结构：训练过程的并行化
4. Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture [C] . Byungchul Hong, Yeonju Ro, John Kim Annual IEEE/ACM International Symposium on Microarchitecture . 2018

机译：以内存为中心的架构的Winograd层的多维并行训练
5. Performance modeling of replication techniques in parallel and distributed layered service architectures. [D] . Al-Omari, Tariq. 2007

机译：并行和分布式分层服务体系结构中复制技术的性能建模。
6. Multilayer perceptron architecture optimization using parallel computing techniques [O] . Wilson Castro, Jimy Oblitas, Roberto Santa-Cruz, 2011

机译：使用并行计算技术的多层感知器体系结构优化
7. Software for Explicitly Parallel Memory-Centric Processor Architecture [O] . Dokoski Goce, Efnusheva Danijela, Tentov Aristol, 2015

机译：显式并行以内存为中心的处理器体系结构的软件

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

摘要

著录项

相似文献

相关主题

期刊订阅