...
首页> 外文期刊>Computer architecture news >Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming
【24h】

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

机译:Warped-Slicer:通过动态资源划分对GPU多程序进行有效的SM内切片

获取原文
获取原文并翻译 | 示例
           

摘要

As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.
机译:随着技术的发展,预计GPU将包含越来越多的计算资源以支持线程级并行性。但是,即使尽了最大的努力,从单个GPU内核(尤其是从通用应用程序)中暴露大量线程级并行度仍将是一个艰巨的挑战。在某些情况下,即使内核中有足够的线程级并行性,也可能没有足够的可用内存带宽来支持如此大量的并发线程执行。因此,由于移植了更多通用应用程序以在GPU上执行,因此GPU资源可能未被充分利用。在本文中,我们探索了多编程GPU作为解决资源利用不足问题的一种方法。越来越多的硬件支持在GPU上进行多重编程。 Hyper-Q已在Kepler体系结构中引入,该体系结构允许通过数十个硬件队列流调用多个内核。已经提出了空间多任务处理以在多个内核之间分配GPU资源。但是分区是在流式多处理器(SM)的粗粒度下完成的,其中每个内核都分配给SM的子集。在本文中,我们主张在多个内核之间划分单个SM,我们将其称为SM内切片。我们探索了各种SM内部切片策略,这些策略在每个SM中对资源进行切片,以在SM上同时运行多个内核。我们的结果表明,没有一种SM内切片策略可以为所有应用程序对带来最佳性能。我们提出了Warped-Slicer,这是一种动态的内部SM切片策略,该策略使用一种分析方法来计算跨不同内核的SM资源分区,从而使性能最大化。该模型依靠一组简短的联机配置文件来确定随着将每个内核中的更多线程块分配给SM时每个内核的性能如何变化。该模型考虑了跨多个内核使用共享资源的干扰影响。该模型在计算上也很有效,并且可以快速确定资源分区,以便在新内核进入系统时进行动态决策。我们证明了所提出的Warped-Slicer方法比基线多编程方法以最少的硬件开销将性能提高了23%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号