首页> 外文会议>IEEE International Conference on Cluster Computing >Optimizing GPU Memory Transactions for Convolution Operations
【24h】

Optimizing GPU Memory Transactions for Convolution Operations

机译:为卷积运算优化GPU内存事务

获取原文

摘要

Convolution computation is a common operation in deep neural networks (DNNs) and is often responsible for performance bottlenecks during training and inferencing. Existing approaches for accelerating convolution operations aim to reduce computational complexity. However, these strategies often increase the memory footprint with extra memory accesses, thereby leaving much room for performance improvement. This paper presents a novel approach to optimize memory access for convolution operations, specifically targeting GPU execution. Our approach leverages two optimization techniques to reduce the number of memory operations for convolution operations performed on the width and height dimensions. For convolution computations on the width dimension, we exploit shuffle instructions to exchange the overlapped columns of the input for reducing the number of memory transactions. For convolution operations on the height dimension, we multiply each overlapped row of the input with multiple rows of a filter to compute multiple output elements to improve the data locality of row elements. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to speedups over the quickest algorithm of cuDNN. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over $mathbf{2}imes$ faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to $mathbf{1.3}imes$ speedups over the quickest algorithm of cuDNN.
机译:卷积计算是深度神经网络(DNN)中的常见操作,通常是造成训练和推理过程中性能瓶颈的原因。现有的用于加速卷积运算的方法旨在降低计算复杂度。但是,这些策略通常会通过额外的内存访问来增加内存占用量,从而为性能提升留出了很大的空间。本文提出了一种新颖的方法来优化卷积操作的内存访问,特别是针对GPU执行。我们的方法利用两种优化技术来减少在宽度和高度尺寸上执行卷积操作的内存操作数。对于宽度维度上的卷积计算,我们利用混洗指令来交换输入的重叠列,以减少内存事务的数量。对于高度维度的卷积运算,我们将输入的每个重叠行与过滤器的多行相乘,以计算多个输出元素,以改善行元素的数据局部性。我们将我们的方法应用于NVIDIA 2080Ti GPU上的2D和多通道2D卷积。对于2D卷积,我们的方法提供了比最新的图像处理库更快的性能。对于多通道2D卷积,我们通过最快的cuDNN算法获得了加速。我们将我们的方法应用于NVIDIA 2080Ti GPU上的2D和多通道2D卷积。对于2D卷积,我们的方法可以提供 $ \ mathbf {2} \次$ 比最先进的图像处理库更快的性能。对于多通道2D卷积,我们获得 $ \ mathbf {1.3} \次$ 加快了cuDNN最快算法的速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号