首页> 外文会议>IEEE International Symposium on High Performance Computer Architecture >Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding
【24h】

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

机译:翘曲 - 预先屈绝:一种改善延迟隐藏的GPU预先执行方法

获取原文

摘要

This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to repurpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is trans-formed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.
机译:本文介绍了提高GPU性能的预先执行方法,称为P模式(预先执行模式)。 GPU利用许多并发线程来隐藏操作的操作延迟。然而,诸如片上存储器访问的某些长期操作通常需要数百个周期,因此即使在存在线程并发和快速螺纹切换能力的情况下也会导致停顿。目前尚不清楚是否添加更多线程可以提高由于内存争用增加而改善延迟容差。此外,添加更多线程会增加片上存储需求。相反,我们建议当扭曲停滞在长期操作上时,它进入P模式。在P模式中,扭曲继续提取和解码连续指令以识别不在长期潜在依赖链上的任何独立指令。然后预先执行这些独立的指令。为了解决写入后写和写入后读取的危险,在P模式输出期间写入以重命名物理寄存器。我们利用注册文件未充分利用,以便将一些未使用的寄存器重新保留存储P模式结果。当扭曲从P模式切换到正常执行模式时,它通过读取重命名的寄存器来重用预先执行的结果。 P模式中的任何全局负载操作都被转换为预负载,将数据提取到L1缓存中以减少未来的内存访问惩罚。我们的评估结果显示了内存密集型应用的23%性能改进,而不会对其他应用类别产生负面影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号