首页> 外文期刊>IEEE Transactions on Computers >Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels
【24h】

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels

机译:通过Megakernels实现GPU上的高效快速卷积算法

获取原文
获取原文并翻译 | 示例

摘要

Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution operations. To address the overwhelming computation problem, Winograd and FFT fast algorithms have been used as effective approaches to reduce the number of multiplications. Inputs and filters are transformed into special domains then perform element-wise multiplication, which can be transformed into batched GEMM operation. Different stages of computation contain multiple tasks with different computation and memory behaviors, and they share intermediate data, which provides the opportunity to fuse these tasks into a monolithic kernel. But traditional kernel fusion suffers from the problem of insufficient shared memory, which limits the performance. In this article, we propose a new kernel fusion technique for fast convolution algorithms based on MegaKernel. GPU thread blocks are assigned with different computation tasks and we design a mapping algorithm to assign tasks to thread blocks. We build a scheduler which fetches and executes the tasks following the dependency relationship. Evaluation of modern CNNs shows that our techniques achieve an average of 1.25X and 1.7X speedup compared to cuDNN's two implementations on Winograd convolution algorithm.
机译:现代卷积神经网络(CNNS)需要大量的卷积操作。为了解决压倒性的计算问题,WinoGrad和FFT快速算法已被用作减少乘法次数的有效方法。输入和过滤器将转换为特殊域,然后执行元素 - WISE乘法,可以将其转换为批量的GEMM操作。不同的计算阶段包含具有不同计算和内存行为的多个任务,它们共享中间数据,这提供了将这些任务融入单片内核中的机会。但传统的内核融合遭受了共享内存不足的问题,这限制了性能。在本文中,我们为基于Megakernel的快速卷积算法提出了一种新的核心融合技术。 GPU线程块分配有不同的计算任务,我们设计了一个映射算法,以将任务分配给线程块。我们构建一个调度程序,该计划程序在依赖关系之后获取和执行任务。与Cudnn在Winograd卷积算法上的两种实施相比,现代CNNS的评估表明,我们的技术平均达到1.25倍和1.7倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号