首页> 外文期刊>IEEE Transactions on Computers >Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance
【24h】

Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

机译:高效管理缓存访问以提升GPGPU内存子系统性能

获取原文
获取原文并翻译 | 示例

摘要

To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.
机译:为了支持大量的内存访问,GPGPU应用程序生成,GPU内存层次结构变得越来越复杂,并且最后一个级别的高速缓存(LLC)大小显着增加了每个GPU的生成。本文表明,反直观地扩大LLC在大多数应用中引起边际性能增益。换句话说,增加LLC尺寸既不在性能下都不扩展,也不会使能量消耗量。我们研究LLC未命中在典型的GPU中如何管理,我们发现在大多数情况下,管理LLC未命中的方式是主要的性能限制器。本文提出了一种新的方法,可以通过利用微小的额外获取和更换高速缓存的结构(FRC)来解决这种缺点,该结构(FRC)存储进入块的控制和一致性信息,直到它们从主存储器中获取。然后,将获取的块与LLC中的受害块(即,选择被选中)交换,并且从FRC执行这种受害块的驱逐。这种方法由于三个主要原因而提高了性能:i)被替换的块的寿命被放大,ii)主存储器路径未跳过LLC未命中的长突发,并且III)降低了平均LLC小姐延迟。该提案提高了LLC命中比率,内存级并行性,并减少了与更大的传统高速缓存相比的延迟。此外,这是通过降低的能量消耗和更少的区域要求实现。实验结果表明,采用GPU计算单元数量和LLC尺寸的性能中所提出的FRC缓存尺度,因此,根据FRC尺寸,性能提高了现代基线GPU卡的30%至67%,以及32个较大的GPU 118%。此外,较大的GPU的能量消耗平均降低到57%至57%。这些效益在LLC基线上增加了一小部分(按7.3%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号