Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

Jia Liancheng; Liang Yun; Li Xiuhong; Lu Liqiang; Yan Shengen

首页> 外文期刊>IEEE Transactions on Computers >Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

【24h】

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

机译：通过Megakernels实现GPU上的高效快速卷积算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed into special domains then perform element-wise multiplication, which can be transformed into batched GEMM operation. Different stages of computation contain multiple tasks with different computation and memory behaviors, and they share intermediate data, which provides the opportunity to fuse these tasks into a monolithic kernel. But traditional kernel fusion suffers from the problem of insufficient shared memory, which limits the performance. In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel. GPU thread blocks are assigned with different computation tasks and we design a mapping algorithm to assign tasks to thread blocks. We build a scheduler which fetches and executes the tasks following the dependency relationship. Evaluation of modern CNNs shows that our techniques achieve an average of 1.25X and 1.7X speedup compared to cuDNN's two implementations on Winograd convolution algorithm.

机译：现代卷积神经网络（CNNS）需要大量的卷积操作。为了解决压倒性的计算问题，WinoGrad和FFT快速算法已被用作减少乘法次数的有效方法。输入和过滤器将转换为特殊域，然后执行元素 - WISE乘法，可以将其转换为批量的GEMM操作。不同的计算阶段包含具有不同计算和内存行为的多个任务，它们共享中间数据，这提供了将这些任务融入单片内核中的机会。但传统的内核融合遭受了共享内存不足的问题，这限制了性能。在本文中，我们为基于Megakernel的快速卷积算法提出了一种新的核心融合技术。 GPU线程块分配有不同的计算任务，我们设计了一个映射算法，以将任务分配给线程块。我们构建一个调度程序，该计划程序在依赖关系之后获取和执行任务。与Cudnn在Winograd卷积算法上的两种实施相比，现代CNNS的评估表明，我们的技术平均达到1.25倍和1.7倍的加速。

著录项

来源
《IEEE Transactions on Computers》 |2020年第7期|986-997|共12页
作者
Jia Liancheng; Liang Yun; Li Xiuhong; Lu Liqiang; Yan Shengen;
展开▼
作者单位

Peking Univ Ctr Energy Efficient Comp & Applicat Beijing 100871 Peoples R China;

Peking Univ Ctr Energy Efficient Comp & Applicat Beijing 100871 Peoples R China;

Peking Univ Ctr Energy Efficient Comp & Applicat Beijing 100871 Peoples R China;

Peking Univ Ctr Energy Efficient Comp & Applicat Beijing 100871 Peoples R China;

Sensetime Grp Hong Kong Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Kernel; Convolution; Task analysis; Graphics processing units; Tensile stress; Instruction sets; Libraries;

机译：内核;卷积;任务分析;图形处理单元;拉伸应力;指令集;图书馆;

相似文献

外文文献
中文文献
专利

1. A fast and memory saved GPU acceleration algorithm of convolutional neural networks for target detection [J] . Li Shijie, Dou Yong, Niu Xin, Neurocomputing . 2017,第MARa22期

机译：卷积神经网络的一种快速且节省内存的GPU加速算法，用于目标检测
2. Fast and Efficient Algorithms for Computational Electromagnetics on GPU Architecture [J] . Tautvydas Maceina, Paolo Bettini, Gabriele Manduchi, IEEE Transactions on Nuclear Science . 2017,第7期

机译：GPU体系结构上计算电磁的快速高效算法
3. An Efficient algorithm For Cyclic Convolution Based On Fast-Polynomial And Fast-W Transforms [J] . Cheng Lizhi, Jiang Zengrong Circuits Systems and Signal Processing . 2001,第1期

机译：基于快速多项式和快速W变换的高效循环卷积算法
4. GPNPU: Enabling Efficient Hardware-Based Direct Convolution with Multi-Precision Support in GPU Tensor Cores [C] . Zhuoran Song, Jianfei Wang, Tianjian Li, ACM/IEEE Design Automation Conference . 2020

机译：GPNPU：在GPU Tensor内核中通过多精度支持实现高效的基于硬件的直接卷积
5. Efficient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs [D] . Tabbakh, Abdulaziz. 2018

机译：高效的内存一致性和一致性支持，以便在GPU中启用数据共享
6. Protein alignment algorithms with an efficient backtracking routine on multiple GPUs [O] . Jacek Blazewicz, Wojciech Frohmberg, Michal Kierzynka, 2011

机译：具有多个GPU上高效回溯例程的蛋白质比对算法
7. Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels [O] . Liancheng Jia, Yun Liang, Xiuhong Li, 2020

机译：通过Megakernels实现GPU上的高效快速卷积算法

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅