Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

Lin Zhen; Dai Hongwen; Mantor Michael; Zhou Huiyang

首页> 外文期刊>Annals of the American Thoracic Society >Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

【24h】

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

机译：协调CTA组合和带宽分区，用于GPU并发内核执行

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel execution (CKE) improves both resource utilization and computational throughput. Most of the prior works focus on partitioning the GPU resources at the cooperative thread array (CTA) level or the warp scheduler level to improve CKE. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. The reason is that bandwidth over-subscription from bandwidth-intensive kernels leads to much aggravated memory access latency, which is highly detrimental to latency-sensitive kernels. Even among bandwidth-intensive kernels, more intensive kernels may unfairly consume much higher bandwidth than less-intensive ones.

机译：当代GPU支持多个内核在同一流式多处理器（SMS）上同时运行。最近的研究表明，这种并发内核执行（CKE）提高了资源利用率和计算吞吐量。最先前的工作中的大多数都侧重于在协作线程阵列（CTA）级别或WARP调度程序级别的GPU资源来改进CKE。然而，当延迟敏感的内核共同运行带宽密集型的核心时，观察到显着的性能放缓和不公平。原因是带宽密集内核的带宽过度订阅导致大量恶化的内存访问延迟，这对延迟敏感内核非常有害。即使在带宽密集的内核中，更多的密集内核也可能不公平地消耗比较小密集型的带宽更高。

著录项

来源
《Annals of the American Thoracic Society 》 |2019年第3期| 共27页
作者
Lin Zhen; Dai Hongwen; Mantor Michael; Zhou Huiyang;
展开▼
作者单位

North Carolina State Univ Raleigh NC 27606 USA;

North Carolina State Univ Raleigh NC 27606 USA;

Adv Micro Devices Inc Orlando FL 32817 USA;

North Carolina State Univ Raleigh NC 27606 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类呼吸系及胸部疾病 ;
关键词
GPGPU; TLP; bandwidth management; concurrent kernel execution;

机译：GPGPU;TLP;带宽管理;并发内核执行;

相似文献

外文文献
专利

1. Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution [J] . Dominik Strzalka Computing reviews . 2020 ,第2期

机译：协调的CTA组合和带宽分区，可同时执行GPU
2. Using machine learning techniques to analyze the performance of concurrent kernel execution on GPUs [J] . Pablo Carvalho, Esteban Clua, Aline Paes, Future generation computer systems . 2020 ,第Deca期

机译：使用机器学习技术分析GPU上并发内核执行的性能
3. Fair and cache blocking aware warp scheduling for concurrent kernel execution on GPU [J] . Chen Zhao, Wu Gao, Feiping Nie, Future generation computer systems . 2020 ,第Nova期

机译：公平和缓存阻止了GPU上的并发内核执行的意识扭曲调度
4. Machine Learning-based Interference Detection in GPGPU Concurrent Kernel Execution [C] . Negar Sadat Alizadeh, Mahmoud Momtazpour International Computer Conference, Computer Society of Iran . 2020

机译：GPGPU并行内核执行中基于机器学习的干扰检测
5. Analysis of Unified Memory Performance and Protection for Concurrent Kernel Execution [D] . Mankad, Kartik 2018

机译：统一内存性能和保护对并发内核执行的保护分析
6. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs GPUs and MICs: A Case Study with Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Guilherme Andrade, -1

机译：具有多核CPUGPU和MIC的系统上的应用程序性能分析和高效执行：以显微镜图像分析为例
7. Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution [O] . Zhen Lin, Hongwen Dai, Michael Mantor, 2019

机译：用于GPU并发内核执行的协调CTA组合和带宽分区

Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

摘要

著录项

相似文献

相关主题

期刊订阅