首页> 外文会议>IEEE International Symposium on High Performance Computer Architecture >Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
【24h】

Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

机译:通过减少内存管道停顿来加速GPU并发内核执行

获取原文

摘要

Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.
机译:在技​​术缩放的进步之后,图形处理单元(GPU)包含越来越多的计算资源,并且对于单个GPU内核变得困难以充分利用庞大的GPU资源。提高资源利用率的一个解决方案是并发内核执行(CKE)。早期CKE主要针对剩余的资源。但是,它未能优化资源利用率,并且不提供并发内核之间的公平性。空间多任务处理将流媒体多处理器(SMS)的子集分配给每个内核。虽然实现了更好的公平性,但没有解决SM内的资源未充分利用。因此,已经提出了SM内分享,以向每个SM发出来自不同内核的线程块。然而,如本研究所示,由于核中的严重干扰,在短时间内的SM分享方案中可能会破坏整体性能。具体而言,作为并发内核共享存储器子系统,一个内核,即使是计算密集型,也可以从不能及时发出内存指令。此外,严重的L1 D-Cache Thrashing和Memory Pipeline Stalls由一个内核引起的,尤其是内存密集型,将影响其他内核,进一步损害整体性能。在这项研究中,我们调查了各种方法来克服在SM内分享中暴露的上述问题。我们首先突出显示CPU建议的缓存分区技术对GPU无效。然后我们提出了两种方法来减少内存管道摊位。首先是要平衡并发内核的内存访问。第二个是限制从单个内核发出的机芯内存指令的数量。我们的评价表明,该方案显着提高了两种最先进的三种间分享方案,扭转切片器和SMK的加权加速,平均分别为24.6 \%和27.2 \%,具有轻量级硬件开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号