首页> 外文期刊>ACM transactions on reconfigurable technology and systems >NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs
【24h】

NEURAghe: Exploiting CPU-FPGA Synergies for Efficient and Flexible CNN Inference Acceleration on Zynq SoCs

机译:NEURAghe:利用CPU-FPGA协同功能在Zynq SoC上实现高效灵活的CNN推理加速

获取原文
获取原文并翻译 | 示例

摘要

Deep convolutional neural networks (CNNs) obtain outstanding results in tasks that require human-level understanding of data, like image or speech recognition. However, their computational load is significant, motivating the development of CNN-specialized accelerators. This work presents NEURA(GHE), a flexible and efficient hardware/software solution for the acceleration of CNNs on Zynq SoCs. NEURA(GHE) leverages the synergistic usage of Zynq ARM cores and of a powerful and flexible Convolution-Specific Processor deployed on the reconfigurable logic. The Convolution-Specific Processor embeds both a convolution engine and a programmable soft core, releasing the ARM processors from most of the supervision duties and allowing the accelerator to be controlled by software at an ultra-fine granularity. This methodology opens the way for cooperative heterogeneous computing: While the accelerator takes care of the bulk of the CNN workload, the ARM cores can seamlessly execute hard-to-accelerate parts of the computational graph, taking advantage of the NEON vector engines to further speed up computation. Through the companion NeuDNN SW stack, NEURA(GHE) supports end-to-end CNN-based classification with a peak performance of 169GOps/s, and an energy efficiency of 17GOps/W. Thanks to our heterogeneous computing model, our platform improves upon the state-of-the-art, achieving a frame rate of 5.5 frames per second (fps) on the end-to-end execution of VGG-16 and 6.6 fps on ResNet-18.
机译:深度卷积神经网络(CNN)在需要人类对数据的理解(例如图像或语音识别)的任务中获得出色的结果。但是,它们的计算量很大,从而刺激了CNN专用加速器的发展。本文介绍了NEURA(GHE),这是一种灵活高效的硬件/软件解决方案,用于在Zynq SoC上加速CNN。 NEURA(GHE)充分利用了Zynq ARM内核以及可重配置逻辑上部署的强大而灵活的卷积专用处理器的协同使用。特定于卷积的处理器同时嵌入了卷积引擎和可编程软核,从而使ARM处理器无需承担大多数监管职责,并可以通过软件以超细粒度控制加速器。这种方法为协作式异构计算开辟了道路:尽管加速器可以处理大量的CNN工作量,但ARM内核可以无缝地执行计算图的难以加速的部分,并利用NEON向量引擎进一步提高速度计算。通过配套的NeuDNN SW堆栈,NEURA(GHE)支持基于端到端CNN的分类,其峰值性能为169GOps / s,能效为17GOps / W。得益于我们的异构计算模型,我们的平台改进了最新技术,在VGG-16的端到端执行中实现了5.5帧/秒的帧速率,在ResNet-上实现了6.6 fps的帧速率。 18岁

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号