Optimizing GPU Memory Transactions for Convolution Operations

机译：为卷积运算优化GPU内存事务

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Convolution computation is a common operation in deep neural networks (DNNs) and is often responsible for performance bottlenecks during training and inferencing. Existing approaches for accelerating convolution operations aim to reduce computational complexity. However, these strategies often increase the memory footprint with extra memory accesses, thereby leaving much room for performance improvement. This paper presents a novel approach to optimize memory access for convolution operations, specifically targeting GPU execution. Our approach leverages two optimization techniques to reduce the number of memory operations for convolution operations performed on the width and height dimensions. For convolution computations on the width dimension, we exploit shuffle instructions to exchange the overlapped columns of the input for reducing the number of memory transactions. For convolution operations on the height dimension, we multiply each overlapped row of the input with multiple rows of a filter to compute multiple output elements to improve the data locality of row elements. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to speedups over the quickest algorithm of cuDNN. We apply our approach to 2D and multi-channel 2D convolutions on an NVIDIA 2080Ti GPU. For 2D convolution, our approach delivers over $mathbf{2}imes$ faster performance than the state-of-the-art image processing libraries. For multi-channel 2D convolutions, we obtain up to $mathbf{1.3}imes$ speedups over the quickest algorithm of cuDNN.

机译：卷积计算是深度神经网络（DNN）中的常见操作，通常是造成训练和推理过程中性能瓶颈的原因。现有的用于加速卷积运算的方法旨在降低计算复杂度。但是，这些策略通常会通过额外的内存访问来增加内存占用量，从而为性能提升留出了很大的空间。本文提出了一种新颖的方法来优化卷积操作的内存访问，特别是针对GPU执行。我们的方法利用两种优化技术来减少在宽度和高度尺寸上执行卷积操作的内存操作数。对于宽度维度上的卷积计算，我们利用混洗指令来交换输入的重叠列，以减少内存事务的数量。对于高度维度的卷积运算，我们将输入的每个重叠行与过滤器的多行相乘，以计算多个输出元素，以改善行元素的数据局部性。我们将我们的方法应用于NVIDIA 2080Ti GPU上的2D和多通道2D卷积。对于2D卷积，我们的方法提供了比最新的图像处理库更快的性能。对于多通道2D卷积，我们通过最快的cuDNN算法获得了加速。我们将我们的方法应用于NVIDIA 2080Ti GPU上的2D和多通道2D卷积。对于2D卷积，我们的方法可以提供 $ \ mathbf {2} \次$ 比最先进的图像处理库更快的性能。对于多通道2D卷积，我们获得 $ \ mathbf {1.3} \次$ 加快了cuDNN最快算法的速度。

著录项

来源
《IEEE International Conference on Cluster Computing》|2020年|399-403|共5页
会议地点
作者
Gangzhao Lu; Weizhe Zhang; Zheng Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Performance Optimization; Convolution; Memory Optimization; GPUs;

机译：性能优化卷积内存优化GPU;

相似文献

外文文献
中文文献
专利

1. Optimizing convolution operations on GPUs using adaptive tiling [J] . Ben van Werkhoven, Jason Maassen, Henri E. Bal, Future generation computer systems . 2014,第jana期

机译：使用自适应平铺优化GPU上的卷积运算
2. BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs [J] . Irving Samuel, Peng Lu, Busch Costas, OASIcs : OpenAccess Series in Informatics . 2021,第a期

机译：bifurktm：GPU的大约一致的分布式事务内存
3. Lightweight Hardware Transactional Memory for GPU Scratchpad Memory [J] . Villegas Alejandro, Asenjo Rafael, Navarro Angeles, Fortschritte der Physik . 2018,第6期

机译：用于GPU ScratchPad内存的轻量级硬件事务内存
4. Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs [C] . Xiaoming Chen, Jianxu Chen, Danny Z. Chen, ACM/EDAC/IEEE Design Automation Conference . 2017

机译：优化Kepler GPU上的卷积内核的内存效率
5. AutoVM: Accelerating Convolutional Neural Network Training with Actively Managed GPU Virtual Memory [D] . Chen, Luyuan . 2020

机译：Autovm：积极管理GPU虚拟内存加速卷积神经网络培训
6. DOPA: GPU-based protein alignment using database and memory access optimizations [O] . Laiq Hasan, Marijn Kentie, Zaid Al-Ars 2011

机译：DOPA：使用数据库和内存访问优化的基于GPU的蛋白质比对
7. Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs [O] . Chen, Xiaoming, Chen, Jianxu, Chen, Danny Z., 2017

机译：优化Kepler GpU上卷积核的内存效率

Optimizing GPU Memory Transactions for Convolution Operations

摘要

著录项

相似文献

相关主题

期刊订阅