首页> 外文会议>International Conference on Field Programmable Logic and Applications >A deep convolutional neural network based on nested residue number system
【24h】

A deep convolutional neural network based on nested residue number system

机译:基于嵌套残数系统的深度卷积神经网络

获取原文

摘要

A pre-trained deep convolutional neural network (DCNN) is the feed-forward computation perspective which is widely used for the embedded vision systems. In the DCNN, the 2D convolutional operation occupies more than 90% of the computation time. Since the 2D convolutional operation performs massive multiply-accumulation (MAC) operations, conventional realizations could not implement a fully parallel DCNN. The RNS decomposes an integer into a tuple of L integers by residues of moduli set. Since no pair of modulus have a common factor with any other, the conventional RNS decomposes the MAC unit into circuits with different sizes. It means that the RNS could not utilize resources of an FPGA with uniform size. In this paper, we propose the nested RNS (NRNS), which recursively decompose the RNS. It can decompose the MAC unit into circuits with small sizes. In the DCNN using the NRNS, a 48-bit MAC unit is decomposed into 4-bit ones realized by look-up tables of the FPGA. In the system, we also use binary to NRNS converters and NRNS to binary converters. The binary to NRNS converter is realized by on-chip BRAMs, while the NRNS to binary one is realized by DSP blocks and BRAMs. Thus, a balanced usage of FPGA resources leads to a high clock frequency with less hardware. The ImageNet DCNN using the NRNS is implemented on a Xilinx Virtex VC707 evaluation board. As for the performance per area GOPS (Giga operations per second) per a slice, the proposed one is 5.86 times better than the existing best realization.
机译:前训练的深度卷积神经网络(DCNN)是前馈计算的观点,已广泛用于嵌入式视觉系统。在DCNN中,二维卷积运算占用了90%以上的计算时间。由于2D卷积运算执行大规模的乘法累加(MAC)运算,因此传统实现无法实现完全并行的DCNN。 RNS通过模集的残基将整数分解为L个整数的元组。由于没有一对模数具有任何其他公因数,因此常规RNS会将MAC单元分解为不同大小的电路。这意味着RNS无法利用大小统一的FPGA资源。在本文中,我们提出了嵌套的RNS(NRNS),它可以递归地分解RNS。它可以将MAC单元分解为小尺寸的电路。在使用NRNS的DCNN中,将48位MAC单元分解为通过FPGA的查询表实现的4位。在系统中,我们还使用二进制到NRNS转换器和NRNS到二进制转换器。二进制到NRNS的转换器由片上BRAM实现,而NRNS到二进制的转换器则由DSP模块和BRAM实现。因此,FPGA资源的平衡使用会导致时钟频率高,硬件更少。使用NRNS的ImageNet DCNN在Xilinx Virtex VC707评估板上实现。至于每片GOPS(每秒千兆操作)的性能,建议的性能是现有最佳实现的5.86倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号