首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs
【24h】

cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs

机译:cCUDA:GPU上并发内核的有效协同调度

获取原文
获取原文并翻译 | 示例

摘要

While GPUs are meantime omnipresent for many scientific and technical computations, they still continue to evolve as processors. An important recent feature is the ability to execute multiple kernels concurrently via queue streams. However, experiments show that different parameters including the behavior of kernels, the order of kernel launches and other execution configurations, e.g., the number of concurrent thread blocks, may result in different execution time for concurrent kernel execution. Since kernels may have different resource requirements, they can be classified into different classes, which are traditionally assumed as either memory-bound or compute-bound. However, a kernel may belong to the different classes on different hardware according to the hardware resources. In this paper, the definition of kernel mix intensity is introduced. Based on this, a scheduling framework called concurrent CUDA (cCUDA) is proposed to co-schedule the concurrent kernels more efficiently. It first profiles and ranks kernels with different execution behaviors and then takes the kernel resource requirements into account to partition thread blocks of different kernels and overlap them to better utilize the GPU resources. Experimental results on real hardware demonstrate performance improvement in terms of execution time of up to 1.86x, and an average speedup of 1.28x for a wide range of kernels. cCUDA is available at https://github.com/kshekofteh/cCUDA.
机译:尽管GPU在许多科学技术计算中无处不在,但它们仍继续作为处理器发展。最近的一项重要功能是能够通过队列流同时执行多个内核。但是,实验表明,不同的参数(包括内核的行为,内核启动的顺序和其他执行配置)(例如,并发线程块的数量)可能会导致并发内核执行的执行时间不同。由于内核可能具有不同的资源要求,因此可以将它们分为不同的类,传统上将其假定为内存绑定或计算绑定。但是,根据硬件资源,内核可能属于不同硬件上的不同类。本文介绍了籽粒混合强度的定义。基于此,提出了一种称为并发CUDA(cCUDA)的调度框架,以更有效地共同调度并发内核。它首先对具有不同执行行为的内核进行概要分析和排名,然后考虑内核资源需求来划分不同内核的线程块,并将它们重叠以更好地利用GPU资源。实际硬件上的实验结果表明,在各种内核上,执行时间最多可提高1.86倍,平均速度可提高1.28倍。 cCUDA可从https://github.com/kshekofteh/cCUDA获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号