首页> 外文期刊>Computer architecture news >Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
【24h】

Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

机译:虚拟线程:最大化线程级并行度,超出GPU调度限制

获取原文
获取原文并翻译 | 示例

摘要

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.
机译:现代GPU需要成千上万个并发线程才能充分利用大量的处理资源。但是,由于缺少线程调度结构(调度限制)(例如可用的程序计数器和单指令多线程堆栈)或由于片上内存不足(容量限制),GPU中的线程并发性可能会降低。注册文件和共享内存。我们的评估表明,实际上,在GPU上运行的许多通用应用程序中的并发性受调度限制而不是容量限制的限制。在不过度增加调度复杂性的情况下,最大化片上内存资源的利用率是本文的主要目标。本文提出了一种虚拟线程(VT)体系结构,该体系结构将协作线程阵列(CTA)分配到最大容量限制,而忽略了调度限制。但是,为了降低同时管理更多线程的逻辑复杂性,我们建议将CTA置于活动状态和非活动状态,以使活动CTA的数量仍然遵守调度限制。当活动CTA中的所有扭曲都陷入长时间等待停顿时,活动CTA会被上下文切换掉,下一个就绪的CTA将取代它。我们利用了这样一个事实,即活动和不活动的CTA都仍在容量限制之内,从而避免了保存和恢复大量CTA状态的需要。因此,VT大大降低了CTA交换的性能损失。通过在活动状态和非活动状态之间交换,VT可以利用更高级别的线程级别并行性,而不会增加逻辑复杂性。我们的仿真结果表明,VT可以将性能平均提高23.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号