首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks
【24h】

OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks

机译:OPU:用于卷积神经网络的基于FPGA的覆盖处理器

获取原文
获取原文并翻译 | 示例

摘要

Field-programmable gate array (FPGA) provides rich parallel computing resources with high energy efficiency, making it ideal for deep convolutional neural network (CNN) acceleration. In recent years, automatic compilers have been developed to generate network-specific FPGA accelerators. However, with more cascading deep CNN algorithms adapted by various complicated tasks, reconfiguration of FPGA devices during runtime becomes unavoidable when network-specific accelerators are employed. Such reconfiguration can be difficult for edge devices. Moreover, network-specific accelerator means regeneration of RTL code and physical implementation whenever the network is updated. This is not easy for CNN end users. In this article, we propose a domain-specific FPGA overlay processor, named OPU to accelerate CNN networks. It offers software-like programmability for CNN end users, as CNN algorithms are automatically compiled into executable codes, which are loaded and executed by OPU without reconfiguration of FPGA for switch or update of CNN networks. Our OPU instructions have complicated functions with variable runtimes but a uniform length. The granularity of instruction is optimized to provide good performance and sufficient flexibility, while reducing complexity to develop microarchitecture and compiler. Experiments show that OPU can achieve an average of 91% runtime multiplication and accumulation unit (MAC) efficiency (RME) among nine different networks. Moreover, for VGG and YOLO networks, OPU outperforms automatically compiled network-specific accelerators in the literature. In addition, OPU shows 5.35x better power efficiency compared with Titan Xp. For a real-time cascaded CNN networks scenario, OPU is 2.9x faster compared with edge computing GPU Jetson Tx2, which has a similar amount of computing resources.
机译:现场可编程门阵列(FPGA)提供了具有高能效的丰富并行计算资源,使其成为深度卷积神经网络(CNN)加速的理想选择。近年来,已经开发了自动编译器来生成特定于网络的FPGA加速器。但是,随着更多级联的深层CNN算法适应各种复杂任务,当使用网络专用加速器时,在运行期间重新配置FPGA设备变得不可避免。对于边缘设备,这样的重新配置可能是困难的。此外,特定于网络的加速器意味着每当网络更新时,RTL代码的再生和物理实现。对于CNN最终用户而言,这并不容易。在本文中,我们提出了一种域专用的FPGA覆盖处理器,称为OPU,以加速CNN网络。它为CNN最终用户提供了类似于软件的可编程性,因为CNN算法被自动编译成可执行代码,由OPU加载和执行,而无需重新配置FPGA来切换或更新CNN网络。我们的OPU指令具有复杂的功能,具有可变的运行时间,但长度均匀。优化了指令的粒度,以提供良好的性能和足够的灵活性,同时降低了开发微体系结构和编译器的复杂性。实验表明,在9个不同的网络中,OPU可以平均实现91%的运行时乘法和累加单位(MAC)效率(RME)。此外,对于VGG和YOLO网络,OPU的性能优于文献中自动编译的特定于网络的加速器。此外,与Titan Xp相比,OPU的电源效率提高了5.35倍。对于实时级联CNN网络场景,与具有类似计算资源量的边缘计算GPU Jetson Tx2相比,OPU快2.9倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号