首页> 外文会议>IEEE International Conference on Artificial Intelligence Circuits and Systems >Energy-Efficient Accelerator Design with 3D-SRAM and Hierarchical Interconnection Architecture for Compact Sparse CNNs
【24h】

Energy-Efficient Accelerator Design with 3D-SRAM and Hierarchical Interconnection Architecture for Compact Sparse CNNs

机译:具有3D-SRAM和分层互连体系结构的节能型加速器设计,适用于紧凑型稀疏CNN

获取原文

摘要

Deep learning applications are deployed to both resource and energy constrained edge devices via compact and sparse CNN models. However, sparsity, feature sizes and filter shapes are widely varying in deep networks resulting in inefficient resource utilization and data movement. In this paper, an energy-efficient accelerator is proposed for compact sparse CNNs by a flexible hierarchical on-chip interconnection architecture, 32 PE tiles and 3D-SRAM. 3D-SRAM are utilized as distributed memory for PE-tiles to hold intermediate data between layers for reducing the energy consumption of off-chip DRAM accesses. Based on distributed 3D-SRAM, output stationary dataflow is adopted without data movement of partial sums among PEs. Therefore, the 32 PE tiles are connected through a configurable ring-based unicast global network with micro-routers, which decreases implementation cost compared to a typical router for a mesh network. Each PE tile is implemented by an all-to-all local network to support widely varying sizes, shapes and non-zero activation computations of compact sparse CNNs. Overall, the proposed accelerator achieves 509.8 inference/sec, 1860.5 inference/J and 383.3 GOPS/W with MobileNetV2, and improves the energy efficiency by a factor of 1.43x over a dense architecture.
机译:深度学习应用程序通过紧凑而稀疏的CNN模型部署到资源和能源受限的边缘设备上。但是,稀疏性,特征尺寸和过滤器形状在深度网络中千差万别,导致资源利用和数据移动效率低下。在本文中,通过灵活的分层片上互连架构,32个PE切片和3D-SRAM,提出了一种用于紧凑型稀疏CNN的节能加速器。 3D-SRAM被用作PE-tile的分布式存储器,以保存各层之间的中间数据,以减少片外DRAM访问的能耗。基于分布式3D-SRAM,采用输出固定数据流,而PE之间没有部分和的数据移动。因此,这32个PE磁贴通过具有微路由器的可配置的基于环的单播全局网络连接,与用于网状网络的典型路由器相比,降低了实现成本。每个PE磁贴均由一个全部到全部的本地网络实现,以支持紧凑型稀疏CNN的尺寸,形状和非零激活计算的广泛变化。总体而言,建议的加速器通过MobileNetV2达到509.8推理/秒,1860.5推理/ J和383.3 GOPS / W,并且在密集型架构上将能效提高了1.43倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号