CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs

Xie Xiaolong; Liang Yun; Li Xiuhong; Wu Yudong; Sun Guangyu; Wang Tao; Fan Dongrui

首页> 外文期刊>Fortschritte der Physik >CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs

【24h】

CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs

机译：CRAT：支持GPU的协调寄存器分配和线程并行优化

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The key to the high performance on GPUs lies in the massive threading to enable thread switching and hide long latencies. CPUs are equipped with a large register file to enable fast context switch. However, thread throttling techniques that are designed to mitigate cache contention, lead to under-utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism (CRAT) to explore the optimization space of register allocation and TLP management on GPUs. CRAT employs both compile-time(CRAT-static) and run-time techniques(CRAT-dyn) to exhaust the design space. CRAT-static works statically to explore TLP and register allocation trade-off and CRAT-dyn exploits dynamic register allocation for further improvement. Experiments indicate that CRAT-static achieves an average 1.25X speedup over existing TLP management technique. On four register-limited applications, CRAT-dyn further improves the performance speedup of CRAT-static from 1.51X to 1.70X.

机译：GPU上高性能的关键位于大规模的线程中，以使线路切换和隐藏长期延迟。 CPU配备了一个大型寄存器文件以启用快速上下文切换。但是，旨在减轻缓存争用的线程限制技术导致寄存器的不利用率。寄存器分配是性能的重要因素，因为它不仅仅是确定单线程性能，而且间接影响TLP。在本文中，我们提出了协调的寄存器分配和线程并行性（CRAT）来探索GPU上的寄存器分配和TLP管理的优化空间。 CRAT采用编译时间（近距离静态）和运行时技术（CRAT-DYN）来排出设计空间。讽刺 - 静态工作静态探索TLP和寄存器分配权衡，CRAT-DYN利用动态寄存器分配进行进一步改进。实验表明，CRAT-STATIC在现有TLP管理技术上实现了平均1.25倍的加速。在四个寄存器限制的应用中，CRAT-DYN进一步改善了1.51倍至1.70倍的CRAT-静态的性能加速。

著录项

来源
《Fortschritte der Physik》 |2018年第6期|共8页
作者
Xie Xiaolong; Liang Yun; Li Xiuhong; Wu Yudong; Sun Guangyu; Wang Tao; Fan Dongrui;
展开▼
作者单位

Peking Univ Sch EECS Ctr Energy Efficient Comp &

Applicat Beijing 100080 Peoples R China;

Peking Univ Sch EECS Ctr Energy Efficient Comp &

Applicat Beijing 100080 Peoples R China;

Peking Univ Sch EECS Ctr Energy Efficient Comp &

Applicat Beijing 100080 Peoples R China;

Peking Univ Sch EECS Ctr Energy Efficient Comp &

Applicat Beijing 100080 Peoples R China;

Peking Univ Sch EECS Ctr Energy Efficient Comp &

Applicat Beijing 100080 Peoples R China;

Peking Univ Sch EECS Ctr Energy Efficient Comp &

Applicat Beijing 100080 Peoples R China;

Chinese Acad Sci Inst Comp Technol Beijing 100864 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类物理学;
关键词
GPGPU; memory hierarchy; compilers;

机译：GPGPU;记忆层次结构;编译器;
入库时间 2022-08-20 03:50:02

相似文献

外文文献
中文文献
专利

1. CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs [J] . Xie Xiaolong, Liang Yun, Li Xiuhong, Fortschritte der Physik . 2018,第6期

机译：CRAT：支持GPU的协调寄存器分配和线程并行优化
2. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit [J] . Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Computer architecture news . 2016,第3期

机译：虚拟线程：最大化线程级并行度，超出GPU调度限制
3. CUDA-NP：Realizing Nested Thread-Level Parallelism in GPGPU Applications [J] . 杨毅, 李超, 周辉阳计算机科学技术学报（英文版） . 2015,第001期

机译：CUDA-NP：在GPGPU应用程序中实现嵌套线程级并行
4. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs [C] . Xiaolong Xie, Yun Liang, Xiuhong Li, Annual IEEE/ACM International Symposium on Microarchitecture . 2015

机译：为GPU启用协调的寄存器分配和线程级并行度优化
5. Throughput Optimization and Resource Allocation on GPUs Under Multi-Application Execution [D] . Punyala, Srinivasa Reddy. 2017

机译：多应用程序执行下GPU上的吞吐量优化和资源分配
6. Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures [O] . Fahad Saeed, Jason D. Hoffert, Trairak Pisitkun, -1

机译：利用多核体系结构利用线程级和指令级并行性对质谱数据进行聚类
7. GPU Performance vs. Thread-Level Parallelism [O] . Zhen Lin, Michael Mantor, Huiyang Zhou 2018

机译：GPU性能与线程级并行性
8. In Search of Speculative Thread-Level Parallelism [R] . Oplinger, J. T. , Heine, D. L. , Lam, M. S. 1999

机译：寻找思辨线程级并行

CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs

摘要

著录项

相似文献

相关主题

期刊订阅