...
首页> 外文期刊>Emerging and Selected Topics in Circuits and Systems, IEEE Journal on >Optimizing Temporal Convolutional Network Inference on FPGA-Based Accelerators
【24h】

Optimizing Temporal Convolutional Network Inference on FPGA-Based Accelerators

机译:优化基于FPGA的加速器的时间卷积网络推断

获取原文
获取原文并翻译 | 示例
           

摘要

Convolutional Neural Networks (CNNs) are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition and segmentation. Recent research results demonstrate that multi-layer (deep) network involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as well as in tasks involving sequence modeling. These structures, commonly referred to as Temporal Convolutional Networks (TCNs), represent an extremely promising alternative to recurrent architectures, commonly used across a broad range of sequence modeling tasks. While FPGA based inference accelerators for classic CNNs are widespread, literature is lacking in a quantitative evaluation of their usability on inference for TCN models. In this paper we present such an evaluation, considering a CNN accelerator with specific features supporting TCN kernels as a reference and a set of state-of-the-art TCNs as a benchmark. Experimental results show that, during TCN execution, operational intensity can be critical for the overall performance. We propose a convolution scheduling based on batch processing that can boost efficiency up to 96% of theoretical peak performance. Overall we can achieve up to 111,8 GOPS/s and a power efficiency of 33,8 GOPS/s/W on an Ultrascale+ ZU3EG (up to 10x speedup and 3x power efficiency improvement with respect to pure software implementation).
机译:卷积神经网络(CNNS)广泛用于广泛的应用程序,通常包括图像和视频分类,识别和分段等计算机视觉任务。最近的研究结果表明,涉及单维卷积和扩张的多层(深)网络可以有效地以时间序列和序列分类和分割,以及涉及序列建模的任务。这些结构通常被称为时间卷积网络(TCN),代表了反复架构的极其有前途的替代方案,通常用于跨越广泛的序列建模任务。虽然基于FPGA的经典CNN的推论加速器是广泛的,但是在TCN模型的推理中的可用性的定量评估中缺乏文献。在本文中,考虑到具有支持TCN核的特定特征的CNN加速器作为参考的CNN加速器作为基准。实验结果表明,在TCN执行期间,操作强度对于整体性能至关重要。我们提出了一种基于批量处理的卷积调度,可以提高高达96%的理论峰值性能的效率。总体而言,我们可以在UltraScale + Zu3EG上实现高达111,8个GOP / s和33,8个GOP / S / W的功率效率(对于纯软件实现,高达10倍的加速和3x功率效率提高)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号