Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

机译：翘曲 - 预先屈绝：一种改善延迟隐藏的GPU预先执行方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to repurpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is trans-formed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

机译：本文介绍了提高GPU性能的预先执行方法，称为P模式（预先执行模式）。 GPU利用许多并发线程来隐藏操作的操作延迟。然而，诸如片上存储器访问的某些长期操作通常需要数百个周期，因此即使在存在线程并发和快速螺纹切换能力的情况下也会导致停顿。目前尚不清楚是否添加更多线程可以提高由于内存争用增加而改善延迟容差。此外，添加更多线程会增加片上存储需求。相反，我们建议当扭曲停滞在长期操作上时，它进入P模式。在P模式中，扭曲继续提取和解码连续指令以识别不在长期潜在依赖链上的任何独立指令。然后预先执行这些独立的指令。为了解决写入后写和写入后读取的危险，在P模式输出期间写入以重命名物理寄存器。我们利用注册文件未充分利用，以便将一些未使用的寄存器重新保留存储P模式结果。当扭曲从P模式切换到正常执行模式时，它通过读取重命名的寄存器来重用预先执行的结果。 P模式中的任何全局负载操作都被转换为预负载，将数据提取到L1缓存中以减少未来的内存访问惩罚。我们的评估结果显示了内存密集型应用的23％性能改进，而不会对其他应用类别产生负面影响。

著录项

来源
《IEEE International Symposium on High Performance Computer Architecture》|2016年|xxi 723 p. :|共13页
会议地点
作者
Keunsoo Kim; Sangpil Lee; Myung Kuk Yoon; Gunjae Koo; Won Woo Ro; Murali Annavaramt;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
hundreds; cycles; hence leads;

机译：数百个;周期;因此引导;

相似文献

外文文献
中文文献
专利

1. An improved data-hiding approach using skin-tone detection for video steganography [J] . Kumar Pankaj, Singh Kulbir Multimedia Tools and Applications . 2018,第18期

机译：使用肤色检测进行视频隐写术的改进的数据隐藏方法
2. An improved reversible data hiding-based approach for intra-frame error concealment in H.264/AVC [J] . Dawen Xu, Rangding Wang, Yun Q. Shi Journal of visual communication & image representation . 2014,第2期

机译：一种改进的基于可逆数据隐藏的H.264 / AVC中帧内错误隐藏方法
3. Modified Approach for Hiding Secret Data and Improving Data Embedding Capacity [J] . Swati Patil, Komal More International Journal of Engineering Research and Applications . 2014,第3期

机译：隐藏秘密数据和提高数据嵌入能力的改进方法
4. Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding [C] . Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, IEEE International Symposium on High Performance Computer Architecture . 2016

机译：翘曲 - 预先屈绝：一种改善延迟隐藏的GPU预先执行方法
5. Understanding Latency Hiding on GPUs. [D] . Volkov, Vasily. 2016

机译：了解GPU上的延迟隐藏。
6. Hiding data in audio files: A smoothing-based approach to improve the quality of the stego audio [O] . Tohari Ahmad, Muhammad Hanif Amrizal, Waskitho Wibisono, 2020

机译：隐藏音频文件中的数据：一种基于平滑的方法可提高隐秘音频的质量
7. Hiding I/O latency with pre-execution prefetching for parallel applications [O] . Yong Chen, Surendra Byna, Xian-he Sun, 2012

机译：通过并行应用程序的预执行预取来隐藏I / O延迟

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

摘要

著录项

相似文献

相关主题

期刊订阅