首页> 外文期刊>Journal of Low Power Electronics >Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units
【24h】

Reducing Power of Memory Hierarchy in General Purpose Graphics Processing Units

机译:在通用图形处理单元中减少内存层次结构的力量

获取原文
获取原文并翻译 | 示例
           

摘要

General Purpose Graphics Processing Units (GPGPUs) are finding applications in high performance computing domains owing to their massively parallel architecture. However, execution of such applications requires huge amounts of data. Therefore, memory sub-systems of GPGPUs need to be able to serve massive amounts of data to processing cores without long access delays. For this reason, the architecture of GPGPUs has evolved to include low-latency memory units such as caches and shared memory. The popularity of GPGPUs in high performance applications has pushed manufacturers to continue to increase the number of cores for every generation. Larger number of cores further increases the amount of data that needs to be serviced by the underlying memory units. To cope with this demand of huge data, size of caches has been growing for newer generations of GPGPUs. However, increased cache sizes exacerbate the problem of power dissipation that is already a major design constraint in processors. Our work proposes two optimization techniques to reduce power consumption in L1 caches (data, texture, constant, and instruction), shared memory and L2 cache. The two optimization techniques target static and dynamic power respectively. Analysis of cache access pattern of several GPGPU applications reveals that consecutive accesses to the same cache block are separated in time by hundreds of clock cycles. This long inter-access cycle presents the unique opportunity of reducing static power by putting cache cells in drowsy mode. The advantage of reducing leakage power using drowsy mode comes at a cost of an increased access time, since the voltage of a drowsy cache cell must be raised before it can be accessed. Our novel technique of coarse grained drowsy mode helps to mitigate the impact on performance. In coarse grained drowsy mode, we partition each cache into regions of contiguous cache blocks. Upon cache access, we wake up the whole cache region that is being accessed. This method exploits temporal and spatial locality of cache accesses The delay is incurred only for the first access to a cache region and subsequent accesses in the same cache region do not incur any delay. This helps to reduce the impact on performance due to wake-up delay. Our second optimization technique takes advantage of branch divergence in GPGPUs. GPGPUs have a Single Instruction Multiple Thread (SIMT) execution model. The SIMT execution model can cause divergence of threads when a control instruction is encountered. GPGPUs execute branch instructions in two phases. Threads in the taken path are active for the first phase, while the rest of the threads are idle. Threads in the not-taken path are executed in the second phase and the rest of the threads remain idle. Contemporary GPGPUs access all portions of cache blocks even when some of the threads are idle due to branch divergence. Our optimization technique proposes to access portion of a cache block that corresponds to active threads. Disabling access to unnecessary sections of cache blocks helps in the reduction of dynamic power. Our results show a significant reduction in static and dynamic power of caches using the two optimization techniques together.
机译:通用图形处理单元(GPGPU)由于其大量平行架构,在高性能计算域中找到了应用。但是,执行此类应用程序需要大量数据。因此,GPGPU的存储器子系统需要能够在没有长期接入延迟的情况下为处理核心提供大量数据。因此,GPGPU的体系结构已经发展为包括诸如缓存和共享存储器的低延迟存储单元。 GPGPU在高性能应用中的普及推动了制造商继续增加每一代的核心数。较大数量的核心进一步增加了基础内存单元所需的数据量。为了应对这一巨大数据的需求,高速缓存的大小一直在增加为较新的GPGPU。然而,增加的缓存大小加剧了已经是处理器中的主要设计约束的功耗问题。我们的工作提出了两种优化技术,以降低L1高速缓存(数据,纹理,常量和指令),共享内存和L2缓存中的功耗。两种优化技术分别针对静态和动态功率。几个GPGPU应用程序的缓存访问模式分析显示,在数百个时钟周期中,对同一缓存块的连续访问分开。这种长的访问循环通过将缓存单元放入昏昏欲睡模式,提供了减少静态功率的独特机会。使用昏昏欲睡模式减少泄漏功率的优点是增加了访问时间的成本,因为必须在访问之前举起昏昏欲睡的缓存小区的电压。我们的粗粒枯萎模式的新技术有助于减轻对性能的影响。在粗粒粒度昏昏欲睡模式中,我们将每个缓存分区为连续缓存块的区域。缓存访问后,我们唤醒正在访问的整个缓存区域。该方法利用高速缓存的时间和空间局部,仅在对高速缓存区域的第一访问时产生延迟,并且在同一高速缓存区域中的后续访问不会产生任何延迟。这有助于降低由于唤醒延迟引起的对性能的影响。我们的第二种优化技术利用了GPGPU中的分支分歧。 GPGPU具有单个指令多线程(SIMT)执行模型。当遇到控制指令时,SIMT执行模型会导致线程的分歧。 GPGPU在两个阶段执行分支指令。拍摄路径中的线程对于第一阶段是活动的,而线程的其余部分是空闲的。未拍摄路径中的线程在第二阶段执行,其余线程保持空闲。当代GPGPUS即使某些线程因分支发散而空闲时,也可以访问缓存块的所有部分。我们的优化技术建议访问与活动线程对应的缓存块的部分。禁用对缓存块的不必要部分的访问有助于减少动态功率。我们的结果表明,使用两个优化技术在一起,缓存的静态和动态功率显着降低。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号