Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

Marvin Damschen; Frank Mueller; Jörg Henkel

首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

【24h】

Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

机译：融合CPU和GPU架构与共享末级缓存的协同调度

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Fused CPU-GPU architectures integrate a CPU and general-purpose GPU on a single die. Recent fused architectures even share the last level cache (LLC) between CPU and GPU. This enables hardware-supported byte-level coherency. Thus, CPU and GPU can execute computational kernels collaboratively, but novel methods to co-schedule work are required. This paper contributes three dynamic co-scheduling methods. Two of our methods implement workers that autonomously acquire work from a common set of independent work items (similar to bag- f-tasks scheduling). The third method, host-side profiling, uses a fraction of the total work of a kernel to determine a ratio of how to distribute work to CPU and GPU based on profiling. The resulting ratio is used for the following executions of the same kernel. Our methods are realized using OpenCL 2.0, which introduces fine-grained shared virtual memory (SVM) to allocate coherent memory between CPU and GPU. We port the Rodinia Benchmark Suite, a standard suite for heterogeneous computing, to fine-grained SVM and fused CPU-GPU architectures (Rodinia-SVM). We evaluate the overhead of fine-grained SVM and analyze the suitability of OpenCL 2.0's new features for co-scheduling. Our host-side profiling method performs competitively to the optimal choice of executing kernels either on CPU or GPU (hypothetical xor-Oracle). On average, it achieves 97% of xor-Oracle's performance and a 1.43× speedup over using the GPU alone (standard in Rodinia). We show, however, that in most cases it is not beneficial to split the work of a kernel between CPU and GPU compared to exclusively running it on the most suitable single compute device. For a fixed amount of work per device, cache-related stalls can increase by up to 1.75× when both devices are used in parallel instead of exclusively while cache misses remain the same. Thus, not the cost of cache conflicts, but inefficient cache coherence is a major performance bottleneck for current fused CPU-GPU Intel architectures with shared LLC.

机译：融合的CPU-GPU架构将CPU和通用GPU集成在单个芯片上。最近融合的架构甚至在CPU和GPU之间共享最后一级缓存（LLC）。这可以实现硬件支持的字节级一致性。因此，CPU和GPU可以协同执行计算内核，但是需要用于共同调度工作的新颖方法。本文贡献了三种动态协同调度方法。我们的两种方法实现了从一组通用的独立工作项中自主获取工作的工作人员（类似于bag-f-tasks计划）。第三种方法是主机方性能分析，它使用内核总工作量的一小部分来确定如何基于性能分析将工作分配给CPU和GPU的比率。所得比率用于同一内核的以下执行。我们的方法是使用OpenCL 2.0实现的，它引入了细粒度的共享虚拟内存（SVM），以在CPU和GPU之间分配一致性内存。我们将用于异构计算的标准套件Rodinia Benchmark Suite移植到细粒度的SVM和融合的CPU-GPU架构（Rodinia-SVM）。我们评估了细粒度SVM的开销，并分析了OpenCL 2.0的新功能用于协同调度的适用性。我们的主机端分析方法在执行CPU或GPU（假设的xor-Oracle）上的内核的最佳选择方面具有竞争优势。平均而言，与单独使用GPU（Rodinia的标准）相比，它可以达到xor-Oracle的97％的性能，并提高1.43倍的速度。但是，我们证明，在大多数情况下，与仅在最合适的单个计算设备上运行内核相比，在CPU和GPU之间分配内核的工作是无益的。对于每个设备固定的工作量，当两个设备并行使用时（而不是仅在缓存未命中率相同的情况下），与缓存相关的停顿可以增加多达1.75倍。因此，对于当前具有共享LLC的融合CPU-GPU Intel体系结构，不是高速缓存冲突的代价，而是低效率的高速缓存一致性是主要的性能瓶颈。

著录项

来源
《IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems》 |2018年第11期|2337-2347|共11页
作者
Marvin Damschen; Frank Mueller; Jörg Henkel;
展开▼
作者单位

Chair for Embedded Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany;

Department of Computer Science, North Carolina State University, Raleigh, NC, USA;

Chair for Embedded Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Graphics processing units; Kernel; Computer architecture; Central Processing Unit; Support vector machines; Performance evaluation; Benchmark testing;

机译：图形处理单元;内核;计算机体系结构;中央处理单元;支持向量机;性能评估;基准测试;

相似文献

外文文献
中文文献
专利

1. SRAM- and STT-RAM-based hybrid, shared last-level cache for on-chip CPU-GPU heterogeneous architectures [J] . Gao Lan, Wang Rui, Xu Yunlong, Journal of supercomputing . 2018,第7期

机译：基于SRAM和STT-RAM的混合，共享的末级缓存，用于片上CPU-GPU异构体系结构
2. Novel fairness-aware co-scheduling for shared cache contention game on chip multiprocessors [J] . Information Sciences: An International Journal . 2020,第期

机译：芯片多处理器上共享高速缓存争用游戏的新型公平感知共同调度
3. An optically-enabled chip-multiprocessor architecture using a single-level shared optical cache memory [J] . Maniotis P., Gitzenis S., Tassiulas L., Optical Switching and Networking . 2016,第nova期

机译：使用单级共享光学高速缓存存储器的具有光学功能的芯片多处理器体系结构
4. WCET analysis of the shared data cache in integrated CPU-GPU architectures [C] . Yijie Huangfu, Wei Zhang IEEE High Performance Extreme Computing Conference . 2017

机译：集成CPU-GPU架构中共享数据缓存的WCET分析
5. High Performance and Energy Efficient Shared Hybrid Last Level Cache Architecture in Multicore Systems [D] . Bhosale, Swapnil. 2018

机译：多核系统中高性能和节能共享混合级缓存架构
6. Shared genetic architecture and casual relationship between leptin levels and type 2 diabetes: large-scale cross-trait meta-analysis and Mendelian randomization analysis [O] . Xinpei Wang, Jinzhu Jia, Tao Huang 2020

机译：瘦素水平与2型糖尿病之间共有的遗传结构和偶然关系：大规模跨特征荟萃分析和孟德尔随机分析
7. Cpu-assisted gpgpu on fused cpu-gpu architectures [O] . Yi Yang, Ping Xiang, Mike Mantor, 2013

机译：融合cpu-gpu架构上的cpu辅助gpgpu

Co-Scheduling on Fused CPU-GPU Architectures With Shared Last Level Caches

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅