首页> 外文会议>Mediterranean Conference on Embedded Computing >Bitwise Neural Network Acceleration: Opportunities and Challenges
【24h】

Bitwise Neural Network Acceleration: Opportunities and Challenges

机译:逐位神经网络加速:机遇与挑战

获取原文

摘要

Real-time inference of deep convolutional neural networks (CNNs) on embedded systems and SoCs would enable many interesting applications. However these CNNs are computation and data expensive, making it difficult to execute them in real-time on energy constrained embedded platforms. Resent research has shown that light-weight CNNs with quantized model weights and activations constrained to one bit only {-1,+ 1} can still achieve reasonable accuracy, in comparison to the non quantized 32-bit model. These binary neural networks (BNNs) theoretically allow to drastically reduce the required energy and run-time by reduction of memory size, number of memory accesses, and finally computation power by replacing expensive two's complement arithmetic operations with more efficient bitwise versions. To make use of these advantages, we propose a bitwise CNN accelerator (BNNA) mapped on an FPGA. We implement the Hubara'16 network [1] on the Xilinx Zynq-7020 SoC. Massive parallelism is achieved performing 4608 parallel binary MACs in total, which enables us to archive real-time speed up to 110 fps, while using only 22% of the FPGA LUTs. In comparison to a 32-bit network, a speed up of 32 times is achieved, and a resource reduction of 40 times is achieved, where the memory bandwidth is the main bottleneck. The provided detailed analysis of the carefully crafted accelerator design exposes the challenges and opportunities in bitwise neural network accelerator design.
机译:嵌入式系统和SoC上的深度卷积神经网络(CNN)的实时推断将使许多有趣的应用成为可能。但是,这些CNN的计算和数据量很大,因此很难在能耗受限的嵌入式平台上实时执行它们。最近的研究表明,与未量化的32位模型相比,具有量化的模型权重和激活仅被限制为一位的轻量CNN仍可以达到合理的精度{-1,+ 1}。从理论上讲,这些二进制神经网络(BNN)可以通过减小内存大小,减少内存访问次数以及最终通过用更高效的按位版本替换昂贵的二进制补码算术运算来降低计算能力,从而大幅减少所需的能量和运行时间。为了利用这些优势,我们提出了在FPGA上映射的按位CNN加速器(BNNA)。我们在Xilinx Zynq-7020 SoC上实现Hubara'16网络[1]。通过执行总共4608个并行二进制MAC,实现了大规模并行处理,这使我们能够以高达110 fps的速度归档实时速度,而仅使用22%的FPGA LUT。与32位网络相比,速度提高了32倍,资源减少了40倍,其中内存带宽是主要瓶颈。提供的精心设计的加速器设计的详细分析揭示了按位神经网络加速器设计中的挑战和机遇。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号