Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

机译：通过减少内存管道停顿来加速GPU并发内核执行

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

机译：在技术缩放的进步之后，图形处理单元（GPU）包含越来越多的计算资源，并且对于单个GPU内核变得困难以充分利用庞大的GPU资源。提高资源利用率的一个解决方案是并发内核执行（CKE）。早期CKE主要针对剩余的资源。但是，它未能优化资源利用率，并且不提供并发内核之间的公平性。空间多任务处理将流媒体多处理器（SMS）的子集分配给每个内核。虽然实现了更好的公平性，但没有解决SM内的资源未充分利用。因此，已经提出了SM内分享，以向每个SM发出来自不同内核的线程块。然而，如本研究所示，由于核中的严重干扰，在短时间内的SM分享方案中可能会破坏整体性能。具体而言，作为并发内核共享存储器子系统，一个内核，即使是计算密集型，也可以从不能及时发出内存指令。此外，严重的L1 D-Cache Thrashing和Memory Pipeline Stalls由一个内核引起的，尤其是内存密集型，将影响其他内核，进一步损害整体性能。在这项研究中，我们调查了各种方法来克服在SM内分享中暴露的上述问题。我们首先突出显示CPU建议的缓存分区技术对GPU无效。然后我们提出了两种方法来减少内存管道摊位。首先是要平衡并发内核的内存访问。第二个是限制从单个内核发出的机芯内存指令的数量。我们的评价表明，该方案显着提高了两种最先进的三种间分享方案，扭转切片器和SMK的加权加速，平均分别为24.6 \％和27.2 \％，具有轻量级硬件开销。

著录项

来源
《IEEE International Symposium on High Performance Computer Architecture》|2018年|208-220|共13页
会议地点
作者
Hongwen Dai; Zhen Lin; Chao Li; Chen Zhao; Fei Wang; Nanning Zheng; Huiyang Zhou;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Kernel; Graphics processing units; Pipelines; Instruction sets; Resource management; Benchmark testing; Computer architecture;

机译：内核;图形处理单元;管道;指令集;资源管理;基准测试;计算机体系结构;

相似文献

外文文献
中文文献
专利

1. Using machine learning techniques to analyze the performance of concurrent kernel execution on GPUs [J] . Pablo Carvalho, Esteban Clua, Aline Paes, Future generation computer systems . 2020,第Deca期

机译：使用机器学习技术分析GPU上并发内核执行的性能
2. Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers [J] . Mohammed Sourouri, Scott B. Baden, Xing Cai International journal of parallel programming . 2017,第3期

机译：Panda：在GPU加速的超级计算机上同时执行3D模具计算的CPU + GPU执行的编译器框架
3. Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution [J] . Dominik Strzalka Computing reviews . 2020,第2期

机译：协调的CTA组合和带宽分区，可同时执行GPU
4. Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls [C] . Hongwen Dai, Zhen Lin, Chao Li, IEEE International Symposium on High Performance Computer Architecture . 2018

机译：通过减轻内存管道摊位来加速GPU并发内核执行
5. Characterization and Exploitation of Nested Parallelism and Concurrent Kernel Execution to Accelerate High Performance Applications. [D] . Nina Paravecino, Fanny. 2017

机译：嵌套并行和并行内核执行的特性和开发，以加速高性能应用程序。
6. GPU-Accelerated Forward and Back-Projections with Spatially Varying Kernels for 3D DIRECT TOF PET Reconstruction [O] . S. Ha, S. Matej, M. Ispiryan, -1

机译：GpU加速前进和后退预测与空间变异性为内核DIRECT 3D TOF pET重建
7. Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution [O] . Zhen Lin, Hongwen Dai, Michael Mantor, 2019

机译：用于GPU并发内核执行的协调CTA组合和带宽分区

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

摘要

著录项

相似文献

相关主题

期刊订阅