首页> 外文OA文献 >Vertical Memory Optimization for High Performance Energy-efficient GPU
【2h】

Vertical Memory Optimization for High Performance Energy-efficient GPU

机译:用于高性能节能GPU的垂直内存优化

摘要

GPU heavily relies on massive multi-threading to achieve high throughput. The massive multi-threading imposes tremendous pressure on different storage components. This dissertation focuses on the optimization of memory subsystem including register file, L1 data cache and device memory, all of which are featured by the massive multi-threading and dominate the efficiency and scalability of GPU.ududA large register file is demanded in GPU for supporting fast thread switching. This dissertation first introduces a power-efficient GPU register file built on the newly emerged racetrack memory (RM). However, the shift operators of RM results in extra power and timing overhead. A holistic architecture-level technology set is developed to conquer the adverse impacts and guarantees its energy merit. Experiment results show that the proposed techniques can keep GPU performance stable compared to the baseline with SRAM based RF. Register file energy is significantly reduced by 48.5%.ududThis work then proposes a versatile warp scheduler (VWS) to reduce the L1 data cache misses in GPU. VWS retains the intra-warp cache locality with a simple yet effective per-warp working set estimator, and enhances intra- and inter-thread-block cache locality using a thread block aware scheduler. VWS achieves on average 38.4% and 9.3% IPC improvement compared to a widely-used and a state-of-the-art warp schedulers, respectively.ududAt last this work targets the off-chip DRAM based device memory. An integrated architecture substrate is introduced to improve the performance and energy efficiency of GPU through the efficient bandwidth utilization. The first part of the architecture substrate, thread batch enabled memory partitioning (TEMP) improves memory access parallelism. TEMP introduces thread batching to separate the memory access streams from SMs. The second part, Thread batch-aware scheduler (TBAS) is then designed to improve memory access locality. Experimental results show that TEMP and TBAS together can obtain up to 10.3% performance improvement and 11.3% DRAM energy reduction for GPU workloads.
机译:GPU在很大程度上依赖大规模多线程来实现高吞吐量。大量的多线程对不同的存储组件施加了巨大的压力。本文主要针对内存子系统的优化,包括寄存器文件,L1数据缓存和设备内存,这些子系统均具有大量的多线程功能,并主导着GPU的效率和可扩展性。 ud ud支持快速线程切换的GPU。本文首先介绍了一种基于省电的赛道存储器(RM)的节能GPU寄存器文件。但是,RM的移位运算符会导致额外的功率和时序开销。开发了一种整体体系结构级别的技术集来克服不利影响并保证其能量价值。实验结果表明,与基于SRAM的RF相比,所提出的技术可以使GPU性能保持稳定。寄存器文件的能量显着减少了48.5%。 ud ud然后,这项工作提出了一种通用的翘曲调度器(VWS),以减少GPU中的L1数据缓存丢失。 VWS通过简单但有效的每线程工作集估计器保留了warp内部缓存的局部性,并使用线程块感知调度程序增强了线程内部和线程间缓存的局部性。与广泛使用的和最先进的翘曲调度器相比,VWS分别平均提高了38.4%和9.3%的IPC。 ud ud最后,这项工作的目标是基于片外DRAM的设备内存。引入了集成架构的基板,以通过有效的带宽利用来提高GPU的性能和能效。体系结构基础的第一部分是启用线程批处理的内存分区(TEMP),可改善内存访问并行性。 TEMP引入线程批处理以将内存访问流与SM分开。第二部分,线程批处理感知调度器(TBAS)然后被设计为改善内存访问局部性。实验结果表明,对于GPU工作负载,TEMP和TBAS可以共同提高高达10.3%的性能提升和11.3%的DRAM能耗降低。

著录项

  • 作者

    Mao Mengjie;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号