SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

Wu Di; Fan Xitian; Cao Wei; Wang Lingli

首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

【24h】

SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

机译：SWM：高性能稀疏 - Winograd矩阵乘法CNN加速器

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.

机译：建议许多卷积神经网络（CNN）加速器最近旨在利用网络的稀疏性，以欣赏计算和记忆降低的益处。然而，大多数加速器不能利用激活和重量的稀疏性。对于利用稀疏机会的作品，他们无法通过静态调度（SS）策略来实现稳定的负载余量，这易受稀疏性分布。在这项工作中，提出了一种平衡的压缩稀疏行格式和动态调度策略来提高负载平衡。还介绍了集合关联结构以在负载平衡和硬件资源开销进行折衷。我们建议SWM加速CNN推理，支持稀疏卷积和稀疏完全连接（FC）层。 SWM为大型卷积内核提供Winograd适应性，并支持16位和8位量化的CNN。由于激活共享，8位处理可以理解的是具有相同稀疏性的16位处理的性能的两倍。该架构进行了vgg16和Reset50评估，其实现：最多为7.6个用于稀疏-Winograd卷积的顶部/秒，以及用于稀疏矩阵乘法的三个顶部/秒，具有16位量化在Xilinx VCu1525平台上。 SWM可以为VGG16 / RENET50处理每秒310/725图像，具有16位量化。与最先进的作品相比，我们的设计可以实现至少1.53倍的加速和1.8倍的能效改进。

著录项

来源
《IEEE transactions on very large scale integration (VLSI) systems》 |2021年第5期|936-949|共14页
作者
Wu Di; Fan Xitian; Cao Wei; Wang Lingli;
展开▼
作者单位

Fudan Univ State Key Lab Applicat Specif Integrated Circuit Shanghai 201203 Peoples R China;

Fudan Univ Sch Comp Sci Shanghai 201203 Peoples R China;

Fudan Univ Sch Microelect Shanghai 201203 Peoples R China;

Fudan Univ Sch Microelect Shanghai 201203 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation;

机译：卷积;稀疏矩阵;加速;载荷建模;核;推理算法;非常大规模的集成;卷积神经网络（CNN）加速;卷积分区;负载平衡;稀疏;WINOGRAD转换;

相似文献

外文文献
中文文献
专利

1. High-Performance System-on-Chip-Based Accelerator System for Polynomial Matrix Multiplications [J] . Kasap Server, Redif Soydan Circuits, systems, and signal processing . 2019,第12期

机译：用于多项式矩阵乘法的高性能基于芯片的加速器系统
2. Automatic Compilation of Diverse CNNs Onto High-Performance FPGA Accelerators [J] . Ma Yufei, Cao Yu, Vrudhula Sarma, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems . 2020,第2期

机译：将各种CNN的自动编译在高性能FPGA加速器上
3. Optimizing Hardware Accelerated General Matrix-Matrix Multiplication for CNNs on FPGAs [J] . Afzal Ahmad, Muhammad Adeel Pasha Circuits and Systems II: Express Briefs, IEEE Transactions on . 2020,第11期

机译：优化硬件加速FPGA上CNN的常规矩阵矩阵乘法
4. SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs [C] . Di Wu, Wei Cao, Lingli Wang International Conference on Field-Programmable Technology . 2019

机译：SpWMM：用于CNN的高性能稀疏Winograd矩阵-矩阵乘法加速器
5. Design of Hardware CNN Accelerators for Audio and Image Classification [D] . Gillela, Rohini Jayachandre. 2020

机译：音频和图像分类硬件CNN加速器的设计
6. An Accelerator Design Using a MTCA Decomposition Algorithm for CNNs [O] . Yunping Zhao, Jianzhuang Lu, Xiaowen Chen 2020

机译：用于CNNS的MTCA分解算法的加速器设计
7. GiMMiK—Generating bespoke matrix multiplication kernels for accelerators: Application to high-order Computational Fluid Dynamics [O] . Wozniak Bartosz D., Witherden Freddie D., Russell Francis P., 2016

机译：GiMMiK-生成用于加速器的定制矩阵乘法内核：在高阶计算流体动力学中的应用

SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅