首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches
【24h】

Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

机译:融合CPU和GPU架构与共享末级缓存的协同调度

获取原文
获取原文并翻译 | 示例

摘要

Fused CPU-GPU architectures integrate a CPU and general-purpose GPU on a single die. Recent fused architectures even share the last level cache (LLC) between CPU and GPU. This enables hardware-supported byte-level coherency. Thus, CPU and GPU can execute computational kernels collaboratively, but novel methods to co-schedule work are required. This paper contributes three dynamic co-scheduling methods. Two of our methods implement workers that autonomously acquire work from a common set of independent work items (similar to bag- f-tasks scheduling). The third method, host-side profiling, uses a fraction of the total work of a kernel to determine a ratio of how to distribute work to CPU and GPU based on profiling. The resulting ratio is used for the following executions of the same kernel. Our methods are realized using OpenCL 2.0, which introduces fine-grained shared virtual memory (SVM) to allocate coherent memory between CPU and GPU. We port the Rodinia Benchmark Suite, a standard suite for heterogeneous computing, to fine-grained SVM and fused CPU-GPU architectures (Rodinia-SVM). We evaluate the overhead of fine-grained SVM and analyze the suitability of OpenCL 2.0's new features for co-scheduling. Our host-side profiling method performs competitively to the optimal choice of executing kernels either on CPU or GPU (hypothetical xor-Oracle). On average, it achieves 97% of xor-Oracle's performance and a 1.43× speedup over using the GPU alone (standard in Rodinia). We show, however, that in most cases it is not beneficial to split the work of a kernel between CPU and GPU compared to exclusively running it on the most suitable single compute device. For a fixed amount of work per device, cache-related stalls can increase by up to 1.75× when both devices are used in parallel instead of exclusively while cache misses remain the same. Thus, not the cost of cache conflicts, but inefficient cache coherence is a major performance bottleneck for current fused CPU-GPU Intel architectures with shared LLC.
机译:融合的CPU-GPU架构将CPU和通用GPU集成在单个芯片上。最近融合的架构甚至在CPU和GPU之间共享最后一级缓存(LLC)。这可以实现硬件支持的字节级一致性。因此,CPU和GPU可以协同执行计算内核,但是需要用于共同调度工作的新颖方法。本文贡献了三种动态协同调度方法。我们的两种方法实现了从一组通用的独立工作项中自主获取工作的工作人员(类似于bag-f-tasks计划)。第三种方法是主机方性能分析,它使用内核总工作量的一小部分来确定如何基于性能分析将工作分配给CPU和GPU的比率。所得比率用于同一内核的以下执行。我们的方法是使用OpenCL 2.0实现的,它引入了细粒度的共享虚拟内存(SVM),以在CPU和GPU之间分配一致性内存。我们将用于异构计算的标准套件Rodinia Benchmark Suite移植到细粒度的SVM和融合的CPU-GPU架构(Rodinia-SVM)。我们评估了细粒度SVM的开销,并分析了OpenCL 2.0的新功能用于协同调度的适用性。我们的主机端分析方法在执行CPU或GPU(假设的xor-Oracle)上的内核的最佳选择方面具有竞争优势。平均而言,与单独使用GPU(Rodinia的标准)相比,它可以达到xor-Oracle的97%的性能,并提高1.43倍的速度。但是,我们证明,在大多数情况下,与仅在最合适的单个计算设备上运行内核相比,在CPU和GPU之间分配内核的工作是无益的。对于每个设备固定的工作量,当两个设备并行使用时(而不是仅在缓存未命中率相同的情况下),与缓存相关的停顿可以增加多达1.75倍。因此,对于当前具有共享LLC的融合CPU-GPU Intel体系结构,不是高速缓存冲突的代价,而是低效率的高速缓存一致性是主要的性能瓶颈。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号