FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

【24h】

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

机译：FeatherCNN：使用TensorGEMM在ARM体系结构上进行快速推理计算

获取原文

获取原文并翻译 | 示例

获取外文期刊封面目录资料

开具论文收录证明 >>

文献代查 >>

文献数据库（团队版） >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Deep Learning is ubiquitous in a wide field of applications ranging from research to industry. In comparison to time-consuming iterative training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, lower latency and higher computation efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN - a fast inference library for ARM CPUs - targeting the performance ceiling of mobile devices. FeatherCNN employs three key techniques: 1) A highly efficient TensorGEMM (generalized matrix multiplication) routine is applied to accelerate Winograd convolution on ARM CPUs, 2) General layer optimization based on custom high performance kernels improves both the computational efficiency and locality of memory access patterns for non-Winograd layers. 3) The framework design emphasizes joint layer-wise optimization using layer fusion to remove redundant calculations and memory movements. Performance evaluation reveals that FeatherCNN significantly outperforms state-of-the-art libraries. A forward propagation pass of VGG-16 on a 64-core ARM server is 48, 14, and 12 times faster than Caffe using OpenBLAS, Caffe2 using Eigen, and NNPACK, respectively. In addition, FeatherCNN is 3.19 times faster than the recently released TensorFlow Lite library on an iPhone 7 plus. In terms of GEMM performance, FeatherCNN achieves 14.8 and 39.0 percent higher performance than Apples Accelerate framework on an iPhone 7 plus and Eigen on a Samsung Galaxy S8, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercnn.

机译：深度学习在从研究到工业的广泛应用领域中无处不在。与费时的卷积神经网络迭代训练相比，推理是一种相对轻量级的操作，使其可以在移动设备上执行。尽管如此，较低的延迟和较高的计算效率对于允许使用复杂的模型和延长电池寿命至关重要。为了应对上述挑战，我们提出了FeatherCNN（一种针对ARM CPU的快速推理库），其目标是移动设备的性能上限。 FeatherCNN采用三种关键技术：1）应用高效的TensorGEMM（通用矩阵乘法）例程来加速ARM CPU上的Winograd卷积; 2）基于定制高性能内核的常规层优化同时提高了计算效率和内存访问模式的局部性用于非Winograd图层。 3）框架设计强调使用层融合来消除冗余计算和内存移动的联合分层优化。性能评估表明，FeatherCNN明显优于最新的库。 VGG-16在64核ARM服务器上的前向传播通道分别比使用OpenBLAS的Caffe，使用Eigen的Caffe2和NNPACK快48倍，14倍和12倍。此外，FeatherCNN比iPhone 7 plus上最近发布的TensorFlow Lite库快3.19倍。就GEMM性能而言，FeatherCNN的性能分别比iPhone 7 plus上的Apples Accelerate框架和三星Galaxy S8上的Eigen高出14.8％和39.0％。 FeatherCNN库的源代码可从https://github.com/tencent/feathercnn公开获得。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2020年第3期|580-594|共15页
作者

展开▼
作者单位

Tencent AI Lab Shenzhen 518000 Peoples R China;

Tencent AI Lab Shenzhen 518000 Peoples R China|Chinese Acad Sci Shenzhen Inst Adv Technol Shenzhen Peoples R China;

Johannes Gutenberg Univ Mainz Parallel & Distributed Architectures Grp Inst Comp Sci D-55122 Mainz Germany;

Shandong Univ Jinan 250100 Peoples R China;

Chinese Acad Sci Shenzhen Inst Adv Technol Shenzhen Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Convolution; Performance evaluation; Optimization; Computer architecture; Acceleration; Mobile handsets; Libraries; Convolutional neural networks; ARM architecture; inference computation; tensorGEMM;

机译：卷积;绩效评估;优化;计算机架构;加速;手机;图书馆;卷积神经网络;ARM架构;推理计算;张量;

相似文献

外文文献
中文文献
专利

1. VLSI Architecture for Fast Computation of 2D-Discrete Wavelet Transform and Low Power Feed Forward Neural Network Architecture for Image Compression [J] . Mr. Murali Mohan. S, Dr. P.Satyanarayana American Journal of Engineering Research . 2013,第10期

机译：用于二维离散小波变换快速计算的VLSI架构和用于图像压缩的低功耗前馈神经网络架构
2. Piecewise Approximate Bayesian Computation: fast inference for discretely observed Markov models using a factorised posterior distribution [J] . S.R. White, T. Kypraios, S.P. Preston Statistics and computing . 2015,第2期

机译：分段近似贝叶斯计算：使用分解后验分布对离散观测的马尔可夫模型进行快速推断
3. A New Computational Framework for Fast Computation of a Class of Polar Harmonic Transforms [J] . Singh Satya P., Urooj Shabana Journal of signal processing systems for signal, image, and video technology . 2019,第8期

机译：快速计算一类极性谐波变换的新计算框架
4. An automated framework for fast cognate detection and Bayesian phylogenetic inference in computational historical linguistics [C] . Taraka Rama, Johann-Mattis List Annual meeting of the Association for Computational Linguistics . 2019

机译：计算历史语言学中快速同源检测和贝叶斯系统发生推断的自动化框架
5. Applications in Computational Genomics: Faster Phylogeny Computation, Pharmacogenetic Dosing, and Higher-Dimensional Adaptive Landscapes. [D] . Porter, Jacob Stuart. 2010

机译：在计算基因组学中的应用：更快的系统发育计算，药理学剂量和更高维度的自适应格局。
6. Piecewise Approximate Bayesian Computation: fast inference for discretely observed Markov models using a factorised posterior distribution [O] . S. R. White, T. Kypraios, S. P. Preston -1

机译：分段近似贝叶斯计算：使用分解后验分布对离散观测的马尔可夫模型进行快速推断
7. Fast Spherical Harmonic Transform Algorithm based on Generalized Fast Multipole Method (Fast Algorithms in Computational Fluids : theory and applications) [O] . Suda Reiji 2008

机译：基于广义快速多极子方法的快速球谐变换算法（计算流体中的快速算法：理论和应用）
8. Synchronized computational architecture for generalized bilateral control of robot arms [R] . 1991

机译：用于机器人手臂广义双边控制的同步计算架构

FeatherCNN: Fast Inference Computation with TensorGEMM on ARM Architectures

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅