...
首页> 外文期刊>International journal of reconfigurable computing >An FPGA-Based Hardware Accelerator for CNNs Using On-Chip Memories Only: Design and Benchmarking with Intel Movidius Neural Compute Stick
【24h】

An FPGA-Based Hardware Accelerator for CNNs Using On-Chip Memories Only: Design and Benchmarking with Intel Movidius Neural Compute Stick

机译:仅用于CNN的FPGA的硬件加速器,仅使用片上存储器:设计和基准与英特尔Movidius神经计算棒

获取原文
获取原文并翻译 | 示例
           

摘要

During the last years, convolutional neural networks have been used for different applications, thanks to their potentiality to carry out tasks by using a reduced number of parameters when compared with other deep learning approaches. However, power consumption and memory footprint constraints, typical of on the edge and portable applications, usually collide with accuracy and latency requirements. For such reasons, commercial hardware accelerators have become popular, thanks to their architecture designed for the inference of general convolutional neural network models. Nevertheless, field-programmable gate arrays represent an interesting perspective since they offer the possibility to implement a hardware architecture tailored to a specific convolutional neural network model, with promising results in terms of latency and power consumption. In this article, we propose a full on-chip field-programmable gate array hardware accelerator for a separable convolutional neural network, which was designed for a keyword spotting application. We started from the model implemented in a previous work for the Intel Movidius Neural Compute Stick. For our goals, we appropriately quantized such a model through a bit-true simulation, and we realized a dedicated architecture exclusively using on-chip memories. A benchmark comparing the results on different field-programmable gate array families by Xilinx and Intel with the implementation on the Neural Compute Stick was realized. The analysis shows that better inference time and energy per inference results can be obtained with comparable accuracy at expenses of a higher design effort and development time through the FPGA solution.
机译:在过去几年中,由于与其他深度学习方法相比,通过使用减少的参数来执行任务的潜力,卷积神经网络已被用于不同的应用。但是,功耗和存储器占用限制,典型的边缘和便携式应用程序,通常与精度和延迟要求碰撞。出于这样的原因,由于其架构设计用于一般卷积神经网络模型的推断,商业硬件加速器变得流行。然而,现场可编程门阵列代表了一个有趣的视角,因为它们提供了实现对特定卷积神经网络模型量身定制的硬件体系结构的可能性,这是在延迟和功耗方面的有希望的结果。在本文中,我们提出了一个完整的片上现场可编程门阵列硬件加速器,用于可分离的卷积神经网络,专为关键字拍摄应用而设计。我们从在上一个工作中实现的模型开始为英特尔Movidius神经计算棒。为我们的目标,我们通过钻头真实仿真适当地量化了这样的模型,我们实现了专门使用片上存储器的专用架构。实现了Xilinx和Intel在神经计算棒上实现不同现场可编程门阵列系列结果的基准与Xilinx和Intel进行了比较。分析表明,通过FPGA解决方案更高的设计努力和开发时间的费用,可以以可比的精度获得每个推断结果的更好的推理时间和能量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号