首页> 外文会议>International Conference on Application-specific Systems, Architectures and Processors >WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs
【24h】

WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs

机译:Winocnn:内核共享Winograd Systolic阵列,用于FPGA上的高效卷积神经网络加速

获取原文

摘要

The combination of Winograd’s algorithm and systolic array architecture has demonstrated the capability of improving DSP efficiency in accelerating convolutional neural networks (CNNs) on FPGA platforms. However, handling arbitrary convolution kernel sizes in FPGA-based Winograd processing elements and supporting efficient data access remain underexplored. In this work, we are the first to propose an optimized Winograd processing element (WinoPE), which can naturally support multiple convolution kernel sizes with the same amount of computing resources and maintains high runtime DSP efficiency. Using the proposed WinoPE, we construct a highly efficient systolic array accelerator, termed WinoCNN. We also propose a dedicated memory subsystem to optimize the data access. Based on the accelerator architecture, we build accurate resource and performance modeling to explore optimal accelerator configurations under different resource constraints. We implement our proposed accelerator on multiple FPGAs, which outperforms the state-of-the-art designs in terms of both throughput and DSP efficiency. Our implementation achieves DSP efficiency up to 1.33 GOPS/DSP and throughput up to 3.1 TOPS with the Xilinx ZCU102 FPGA. These are 29.1% and 20.0% better than the best solutions reported previously, respectively.
机译:WinoGrad算法和收缩系统阵列架构的组合已经证明了提高在FPGA平台上加速卷积神经网络(CNNS)的DSP效率的能力。但是,处理基于FPGA的WinoGrad处理元件中的任意卷积内核大小并支持高效的数据访问仍未实现了曝光率。在这项工作中,我们是第一个提出优化的Winograd处理元素(Winope)的旨在通过相同数量的计算资源支持多个卷积内核大小,并保持高运行时DSP效率。使用所提出的Winope,我们构建了一个高效的Systolic阵列加速器,称为WinoCnn。我们还提出了一个专用的内存子系统来优化数据访问。基于加速器架构,我们建立准确的资源和性能建模,以探索不同资源约束下的最佳加速器配置。我们在多个FPGA上实施我们提出的加速器,这在吞吐量和DSP效率方面优于最先进的设计。我们的实施实现了DSP效率,高达1.33 GOP / DSP和Xilinx ZCU102 FPGA的吞吐量高达3.1顶部。这些比以前报告的最佳解决方案更好地为29.1%和20.0%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号