...
首页> 外文期刊>Progress in Artificial Intelligence >Exploring Efficient Acceleration Architecture for Winograd-Transformed Transposed Convolution of GANs on FPGAs
【24h】

Exploring Efficient Acceleration Architecture for Winograd-Transformed Transposed Convolution of GANs on FPGAs

机译:在FPGA上探索Winograd转换转换转换卷积的高效加速架构

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

The acceleration architecture of transposed convolution layers is essential since transposed convolution operations, as critical components in the generative model of generative adversarial networks, are computationally intensive inherently. In addition, the pre-processing of inserting and padding with zeros for input feature maps causes many ineffective operations. Most of the already known FPGA (Field Programmable Gate Array) based architectures for convolution layers cannot tackle these issues. In this paper, we firstly propose a novel dataflow exploration through splitting the filters and its corresponding input feature maps into four sets and then applying the Winograd algorithm for fast processing with a high efficiency. Secondly, we present an underlying FPGA-based accelerator architecture that features owning processing units, with embedded parallel, pipelined, and buffered processing flow. At last, a parallelism-aware memory partition technique and the hardware-based design space are explored coordinating, respectively, for the required parallel operations and optimal design parameters. Experiments of several state-of-the-art GANs by our methods achieve an average performance of 639.2 GOPS on Xilinx ZCU102 and 162.5 GOPS on Xilinx VC706. In reference to a conventional optimized accelerator baseline, this work demonstrates an 8.6x (up to 11.7x) increase in processing performance, compared to below 2.2x improvement by the prior studies in the literature.
机译:转置卷积层的加速架构是必不可少的,因为转置卷积操作,作为生成对抗网络的生成模型中的关键组分,是固有的计算密集的。另外,使用零用于输入特征图的插入和填充的预处理导致许多无效的操作。对于卷积图层的大多数已知的FPGA(现场可编程门阵列)的架构不能解决这些问题。在本文中,我们首先提出了一种新的DataFlow探索,通过将滤波器分配到四个集中,然后应用Winograd算法以获得高效率的快速处理。其次,我们介绍了具有拥有处理单元的基于基于FPGA的加速器架构,具有嵌入式并联,流水线和缓冲处理流程。最后,对于所需的并行操作和最佳设计参数,分别探索了并行感知内存分区技术和基于硬件的设计空间。我们的方法对几种最先进的GANS的实验实现了Xilinx ZCU102和Xilinx VC706上的Xilinx ZCU102和162.5 GOPS的平均性能。参考传统的优化加速器基线,该工作表明加工性能增加了8.6倍(高达11.7倍),而在文献中的先前研究的提高低于2.2倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号