Graphics processors, or GPUs, have recently been widely used as acceleratorsin the shared environments such as clusters and clouds. In such sharedenvironments, many kernels are submitted to GPUs from different users, andthroughput is an important metric for performance and total ownership cost.Despite the recently improved runtime support for concurrent GPU kernelexecutions, the GPU can be severely underutilized, resulting in suboptimalthroughput. In this paper, we propose Kernelet, a runtime system with dynamicslicing and scheduling techniques to improve the throughput of concurrentkernel executions on the GPU. With slicing, Kernelet divides a GPU kernel intomultiple sub-kernels (namely slices). Each slice has tunable occupancy to allowco-scheduling with other slices and to fully utilize the GPU resources. Wedevelop a novel and effective Markov chain based performance model to guide thescheduling decision. Our experimental results demonstrate up to 31.1% and 23.4%performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.
展开▼
机译:图形处理器或GPU最近已被广泛用作群集和云等共享环境中的加速器。在这样的共享环境中,许多内核是由不同用户提交给GPU的,吞吐量是衡量性能和总拥有成本的重要指标。尽管最近改进了对并行GPU内核执行的运行时支持,但GPU可能被严重利用不足,导致吞吐量不理想。在本文中,我们提出了Kernelet,这是一种具有动态切片和调度技术的运行时系统,可以提高GPU上并发内核执行的吞吐量。通过切片,Kernelet将GPU内核划分为多个子内核(即切片)。每个切片具有可调的占用率,以允许与其他切片进行协同调度并充分利用GPU资源。我们开发了一种新颖有效的基于马尔可夫链的绩效模型来指导调度决策。我们的实验结果表明,NVIDIA Tesla C2050和GTX680 GPU的性能分别提高了31.1%和23.4%。
展开▼