Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

Ahsan Saghir; Ehsan Atoofian; Ali Manzak

首页> 外文期刊>Journal of Low Power Electronics >Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

【24h】

Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

机译：在通用图形处理单元中减少内存层次结构的力量

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

General Purpose Graphics Processing Units (GPGPUs) are finding applications in high performance computing domains owing to their massively parallel architecture. However, execution of such applications requires huge amounts of data. Therefore, memory sub-systems of GPGPUs need to be able to serve massive amounts of data to processing cores without long access delays. For this reason, the architecture of GPGPUs has evolved to include low-latency memory units such as caches and shared memory. The popularity of GPGPUs in high performance applications has pushed manufacturers to continue to increase the number of cores for every generation. Larger number of cores further increases the amount of data that needs to be serviced by the underlying memory units. To cope with this demand of huge data, size of caches has been growing for newer generations of GPGPUs. However, increased cache sizes exacerbate the problem of power dissipation that is already a major design constraint in processors. Our work proposes two optimization techniques to reduce power consumption in L1 caches (data, texture, constant, and instruction), shared memory and L2 cache. The two optimization techniques target static and dynamic power respectively. Analysis of cache access pattern of several GPGPU applications reveals that consecutive accesses to the same cache block are separated in time by hundreds of clock cycles. This long inter-access cycle presents the unique opportunity of reducing static power by putting cache cells in drowsy mode. The advantage of reducing leakage power using drowsy mode comes at a cost of an increased access time, since the voltage of a drowsy cache cell must be raised before it can be accessed. Our novel technique of coarse grained drowsy mode helps to mitigate the impact on performance. In coarse grained drowsy mode, we partition each cache into regions of contiguous cache blocks. Upon cache access, we wake up the whole cache region that is being accessed. This method exploits temporal and spatial locality of cache accesses The delay is incurred only for the first access to a cache region and subsequent accesses in the same cache region do not incur any delay. This helps to reduce the impact on performance due to wake-up delay. Our second optimization technique takes advantage of branch divergence in GPGPUs. GPGPUs have a Single Instruction Multiple Thread (SIMT) execution model. The SIMT execution model can cause divergence of threads when a control instruction is encountered. GPGPUs execute branch instructions in two phases. Threads in the taken path are active for the first phase, while the rest of the threads are idle. Threads in the not-taken path are executed in the second phase and the rest of the threads remain idle. Contemporary GPGPUs access all portions of cache blocks even when some of the threads are idle due to branch divergence. Our optimization technique proposes to access portion of a cache block that corresponds to active threads. Disabling access to unnecessary sections of cache blocks helps in the reduction of dynamic power. Our results show a significant reduction in static and dynamic power of caches using the two optimization techniques together.

机译：通用图形处理单元（GPGPU）由于其大量平行架构，在高性能计算域中找到了应用。但是，执行此类应用程序需要大量数据。因此，GPGPU的存储器子系统需要能够在没有长期接入延迟的情况下为处理核心提供大量数据。因此，GPGPU的体系结构已经发展为包括诸如缓存和共享存储器的低延迟存储单元。 GPGPU在高性能应用中的普及推动了制造商继续增加每一代的核心数。较大数量的核心进一步增加了基础内存单元所需的数据量。为了应对这一巨大数据的需求，高速缓存的大小一直在增加为较新的GPGPU。然而，增加的缓存大小加剧了已经是处理器中的主要设计约束的功耗问题。我们的工作提出了两种优化技术，以降低L1高速缓存（数据，纹理，常量和指令），共享内存和L2缓存中的功耗。两种优化技术分别针对静态和动态功率。几个GPGPU应用程序的缓存访问模式分析显示，在数百个时钟周期中，对同一缓存块的连续访问分开。这种长的访问循环通过将缓存单元放入昏昏欲睡模式，提供了减少静态功率的独特机会。使用昏昏欲睡模式减少泄漏功率的优点是增加了访问时间的成本，因为必须在访问之前举起昏昏欲睡的缓存小区的电压。我们的粗粒枯萎模式的新技术有助于减轻对性能的影响。在粗粒粒度昏昏欲睡模式中，我们将每个缓存分区为连续缓存块的区域。缓存访问后，我们唤醒正在访问的整个缓存区域。该方法利用高速缓存的时间和空间局部，仅在对高速缓存区域的第一访问时产生延迟，并且在同一高速缓存区域中的后续访问不会产生任何延迟。这有助于降低由于唤醒延迟引起的对性能的影响。我们的第二种优化技术利用了GPGPU中的分支分歧。 GPGPU具有单个指令多线程（SIMT）执行模型。当遇到控制指令时，SIMT执行模型会导致线程的分歧。 GPGPU在两个阶段执行分支指令。拍摄路径中的线程对于第一阶段是活动的，而线程的其余部分是空闲的。未拍摄路径中的线程在第二阶段执行，其余线程保持空闲。当代GPGPUS即使某些线程因分支发散而空闲时，也可以访问缓存块的所有部分。我们的优化技术建议访问与活动线程对应的缓存块的部分。禁用对缓存块的不必要部分的访问有助于减少动态功率。我们的结果表明，使用两个优化技术在一起，缓存的静态和动态功率显着降低。

著录项

来源
《Journal of Low Power Electronics》 |2017年第2期|共17页
作者
Ahsan Saghir; Ehsan Atoofian; Ali Manzak;
展开▼
作者单位

Department of Electrical Engineering Lakehead University;

Department of Electrical Engineering Lakehead University;

Department of Electrical and Computer Engineering Gannon University;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;
关键词
GPGPU; CUDA; Memory Hierarchy; Cache; Leakage Power; Dynamic Power;

机译：GPGPU;CUDA;记忆层次结构;缓存;泄漏功率;动态功率;

相似文献

外文文献
中文文献
专利

1. Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units [J] . Ahsan Saghir, Ehsan Atoofian, Ali Manzak Journal of Low Power Electronics . 2017,第2期

机译：在通用图形处理单元中减少内存层次结构的力量
2. Accelerating in-memory transaction processing using general purpose graphics processing units [J] . Gao Lan, Xu Yunlong, Wang Rui, Future generation computer systems . 2019,第Auga期

机译：使用通用图形处理单元加速内存交易处理
3. 3D FEM simulation of SAW resonators using hierarchical cascading technique and general purpose graphic processing unit [J] . Li Xinyi, Bao Jingfu, Qiu Luyan, Japanese journal of applied physics . 2019,第SG期

机译：使用分层级联技术和通用图形处理单元的声表面波谐振器的3D有限元模拟
4. A Data-Traffic Aware Dynamic Power Management for General-Purpose Graphics Processing Units [C] . Lih-Yih Chiou, Chao-Kai Yang, Che-Pin Chang IEEE International Symposium on Circuits and Systems . 2019

机译：通用图形处理单元的数据流量感知动态电源管理
5. Reducing irregularities in control flow and memory access on graphics processing unit architectures. [D] . King, James Sokhom. 2017

机译：减少图形处理单元体系结构上控制流和内存访问的不规则性。
6. A Flexible Hybrid BCH Decoder for Modern NAND Flash Memories Using General Purpose Graphical Processing Units (GPGPUs) [O] . Arul Subbiah, Tokunbo Ogunfunmi 2019

机译：使用通用图形处理单元（GPGPU）的现代NAND闪存的灵活混合BCH解码器
7. Task-Based Parallelism for General Purpose Graphics Processing Units and Hybrid Shared-Distributed Memory Systems. [O] . CHALK AIDANBERNARDGERARD 2017

机译：基于任务的通用图形处理单元和混合共享分布式存储器系统的并行性。

Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

摘要

著录项

相似文献

相关主题

期刊订阅