首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator
【24h】

SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

机译:SWM:高性能稀疏 - Winograd矩阵乘法CNN加速器

获取原文
获取原文并翻译 | 示例

摘要

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.
机译:建议许多卷积神经网络(CNN)加速器最近旨在利用网络的稀疏性,以欣赏计算和记忆降低的益处。然而,大多数加速器不能利用激活和重量的稀疏性。对于利用稀疏机会的作品,他们无法通过静态调度(SS)策略来实现稳定的负载余量,这易受稀疏性分布。在这项工作中,提出了一种平衡的压缩稀疏行格式和动态调度策略来提高负载平衡。还介绍了集合关联结构以在负载平衡和硬件资源开销进行折衷。我们建议SWM加速CNN推理,支持稀疏卷积和稀疏完全连接(FC)层。 SWM为大型卷积内核提供Winograd适应性,并支持16位和8位量化的CNN。由于激活共享,8位处理可以理解的是具有相同稀疏性的16位处理的性能的两倍。该架构进行了vgg16和Reset50评估,其实现:最多为7.6个用于稀疏-Winograd卷积的顶部/秒,以及用于稀疏矩阵乘法的三个顶部/秒,具有16位量化在Xilinx VCu1525平台上。 SWM可以为VGG16 / RENET50处理每秒310/725图像,具有16位量化。与最先进的作品相比,我们的设计可以实现至少1.53倍的加速和1.8倍的能效改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号