Thread Batching for High-performance Energy-efficient GPU Memory Design

首页> 外文期刊>ACM Journal on Emerging Technologies in Computing Systems >Thread Batching for High-performance Energy-efficient GPU Memory Design

【24h】

Thread Batching for High-performance Energy-efficient GPU Memory Design

机译：用于高性能节能GPU存储器设计的线程批量

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly unproved peak memory bandwidth, memory becomes a bottleneck of GPU's performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

机译：GPU中的大量多线程对存储器子系统的压力巨大。由于GPU的螺纹级并行性快速增长，并且缓慢未经经过了峰值存储带宽，记忆成为GPU性能和能效的瓶颈。在本文中，我们提出了一种集成的架构方案来优化内存访问，从而提高GPU的性能和能源效率。首先，我们提出了一个带有线程批处理的内存分区（TEMP），以改善GPU内存访问并行性。特别地，TEMP组多个线程块，该线程将相同的页面共享到螺纹批处理中，并应用页面着色机制将每个流多处理器（SM）绑定到专用存储体。之后，TEMP将线程批处理调度到SM，以确保来自不同线程块的高并行存储器访问流。其次，引入了一种线程批次感知调度（TBA）方案以改善GPU存储器访问局部性，并减少存储器控制器和互连网络上的争用。实验结果表明，TEMP和TBA的整合可实现高达10.3％的性能提升和各种GPU应用的DRAM能量减少11.3％。我们还评估混合CPU + GPU工作负载的性能干扰在采用我们提出的方案的异构系统上运行时。我们的结果表明，简单的解决方案可以有效地确保高效执行GPU和CPU应用。

著录项

来源
《ACM Journal on Emerging Technologies in Computing Systems 》 |2019年第4期| 共21页
作者

展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术 ;
关键词
GPU; memory partitioning; thread batch; warp scheduler;

机译：GPU;内存分区;线程批量;扭曲调度程序;

相似文献

外文文献
中文文献
专利

1. Thread Batching for High-performance Energy-efficient GPU Memory Design [J] . ACM Journal on Emerging Technologies in Computing Systems . 2019 ,第4期

机译：用于高性能节能GPU存储器设计的线程批量
2. Harmonia: Balancing Compute and Memory Power in High-Performance GPUs [J] . Indrani Paul, Wei Huang, Manish Arora, Computer architecture news . 2015 ,第3期

机译：谐波：在高性能GPU中平衡计算和内存功能
3. A high-performance and energy-efficient exhaustive key search approach via GPU on DES-like cryptosystems [J] . Ahmadzadeh Armin, Hajihassani Omid, Gorgin Saeid Journal of supercomputing . 2018 ,第1期

机译：类似于DES的密码系统上通过GPU的高性能，高能效的穷举密钥搜索方法
4. TEMP: Thread batch enabled memory partitioning for GPU [C] . Mengjie Mao, Wujie Wen, Xiaoxiao Liu, ACM/EDAC/IEEE Design Automation Conference . 2016

机译：TEMP：为GPU启用线程批处理的内存分区
5. Content-Aware Memory Systems for High-Performance, Energy-Efficient Data Movement [D] . Wang, Shibo. 2017

机译：用于高性能，高能效数据移动的内容感知存储系统
6. Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs [O] . Nathan D. Ellingwood, Youbing Yin, Matthew Smith, -1

机译：在GPU和多线程CPU上实现多级非刚性批量保存图像注册的有效方法
7. Thread Batching for High-performance Energy-efficient GPU Memory Design [O] . Bing Li, Mengjie Mao, Xiaoxiao Liu, 2019

机译：用于高性能节能GPU存储器设计的线程批量

Thread Batching for High-performance Energy-efficient GPU Memory Design

摘要

著录项

相似文献

相关主题

期刊订阅