Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Myung Kuk Yoon; Keunsoo Kim; Sangpil Lee; Won Woo Ro; Murali Annavaram

首页> 外文期刊>Computer architecture news >Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

【24h】

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

机译：虚拟线程：最大化线程级并行度，超出GPU调度限制

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

机译：现代GPU需要成千上万个并发线程才能充分利用大量的处理资源。但是，由于缺少线程调度结构（调度限制）（例如可用的程序计数器和单指令多线程堆栈）或由于片上内存不足（容量限制），GPU中的线程并发性可能会降低。注册文件和共享内存。我们的评估表明，实际上，在GPU上运行的许多通用应用程序中的并发性受调度限制而不是容量限制的限制。在不过度增加调度复杂性的情况下，最大化片上内存资源的利用率是本文的主要目标。本文提出了一种虚拟线程（VT）体系结构，该体系结构将协作线程阵列（CTA）分配到最大容量限制，而忽略了调度限制。但是，为了降低同时管理更多线程的逻辑复杂性，我们建议将CTA置于活动状态和非活动状态，以使活动CTA的数量仍然遵守调度限制。当活动CTA中的所有扭曲都陷入长时间等待停顿时，活动CTA会被上下文切换掉，下一个就绪的CTA将取代它。我们利用了这样一个事实，即活动和不活动的CTA都仍在容量限制之内，从而避免了保存和恢复大量CTA状态的需要。因此，VT大大降低了CTA交换的性能损失。通过在活动状态和非活动状态之间交换，VT可以利用更高级别的线程级别并行性，而不会增加逻辑复杂性。我们的仿真结果表明，VT可以将性能平均提高23.9％。

著录项

来源
《Computer architecture news》 |2016年第3期|609-621|共13页
作者
Myung Kuk Yoon; Keunsoo Kim; Sangpil Lee; Won Woo Ro; Murali Annavaram;
展开▼
作者单位

School of Electrical and Electronic Engineering, Yonsei University;

School of Electrical and Electronic Engineering, Yonsei University;

School of Electrical and Electronic Engineering, Yonsei University;

School of Electrical and Electronic Engineering, Yonsei University;

Ming Hsieh Department of Electrical Engineering, University of Southern California;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
GPU; GPGPU; Warp Scheduling; Virtual Thread (VT); Capacity Limit; Scheduling Limit;

机译：GPU;GPGPU;翘曲调度;虚拟线程（VT）;容量限制;排程限制;

相似文献

外文文献
中文文献
专利

1. CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs [J] . Xie Xiaolong, Liang Yun, Li Xiuhong, Fortschritte der Physik . 2018,第6期

机译：CRAT：支持GPU的协调寄存器分配和线程并行优化
2. GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP [J] . Lin Zhen, Mantor Michael, Zhou Huiyang ACM Transactions on Architecture and Code Optimization . 2018,第1期

机译：GPU性能与线程平行性：可伸缩性分析和改进TLP的新方法
3. CUDA-NP：Realizing Nested Thread-Level Parallelism in GPGPU Applications [J] . 杨毅, 李超, 周辉阳计算机科学技术学报（英文版） . 2015,第001期

机译：CUDA-NP：在GPGPU应用程序中实现嵌套线程级并行
4. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit [C] . Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, ACM/IEEE Annual International Symposium on Computer Architecture . 2016

机译：虚拟线程：最大化线程级并行度，超出GPU调度限制
5. Exploiting Thread-Level Parallelism on Reconfigurable Architectures: a Cross-Layer Approach [D] . Momeni, Amir. 2017

机译：在可重构体系结构上利用线程级并行性：一种跨层方法
6. Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures [O] . Fahad Saeed, Jason D. Hoffert, Trairak Pisitkun, -1

机译：利用多核体系结构利用线程级和指令级并行性对质谱数据进行聚类
7. GPU Performance vs. Thread-Level Parallelism [O] . Zhen Lin, Michael Mantor, Huiyang Zhou 2018

机译：GPU性能与线程级并行性
8. In Search of Speculative Thread-Level Parallelism [R] . Oplinger, J. T. , Heine, D. L. , Lam, M. S. 1999

机译：寻找思辨线程级并行

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅