首页> 外文期刊>Very Large Scale Integration (VLSI) Systems, IEEE Transactions on >Orchestrating Cache Management and Memory Scheduling for GPGPU Applications
【24h】

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications

机译:为GPGPU应用程序协调缓存管理和内存调度

获取原文
获取原文并翻译 | 示例

摘要

Modern graphics processing units (GPUs) are delivering tremendous computing horsepower by running tens of thousands of threads concurrently. The massively parallel execution model has been effective to hide the long latency of off-chip memory accesses in graphics and other general computing applications exhibiting regular memory behaviors. With the fast-growing demand for general purpose computing on GPUs (GPGPU), GPU workloads are becoming highly diversified, and thus requiring a synergistic coordination of both computing and memory resources to unleash the computing power of GPUs. Accordingly, recent graphics processors begin to integrate an on-die level-2 (L2) cache. The huge number of threads on GPUs, however, poses significant challenges to L2 cache design. The experiments on a variety of GPGPU applications reveal that the L2 cache may or may not improve the overall performance depending on the characteristics of applications. In this paper, we propose efficient techniques to improve GPGPU performance by orchestrating both L2 cache and memory in a unified framework. The basic philosophy is to exploit the temporal locality among the massive number of concurrent memory requests and minimize the impact of memory divergence behaviors among simultaneously executed groups of threads. Our major contributions are twofold. First, a priority-based cache management is proposed to maximize the chance of frequently revisited data to be kept in the cache. Second, an effective memory scheduling is introduced to reorder memory requests in the memory controller according to the divergence behavior for reducing average waiting time of warps. Simulation results reveal that our techniques enhance the overall performance by 10% on average for memory intensive benchmarks, whereas the maximum gain can be up to 30%.
机译:现代图形处理单元(GPU)通过同时运行数以万计的线程来提供巨大的计算能力。大规模并行执行模型已经有效地隐藏了图形和其他显示常规内存行为的通用计算应用程序中片外内存访问的长时间延迟。随着对GPU上通用计算(GPGPU)的快速增长的需求,GPU工作负载正变得高度多样化,因此需要对计算和内存资源进行协同协调以释放GPU的计算能力。因此,最近的图形处理器开始集成片上2级(L2)缓存。但是,GPU上的大量线程对L2缓存设计提出了重大挑战。在各种GPGPU应用程序上进行的实验表明,取决于应用程序的特性,L2缓存可能会或可能不会改善整体性能。在本文中,我们提出了通过在统一框架中协调L2缓存和内存来提高GPGPU性能的有效技术。基本原理是利用大量并发内存请求之间的时间局部性,并最小化同时执行的线程组之间的内存发散行为的影响。我们的主要贡献是双重的。首先,提出了基于优先级的缓存管理,以最大程度地提高频繁访问的数据保留在缓存中的机会。其次,引入有效的存储器调度以根据发散行为对存储器控制器中的存储器请求进行重新排序,以减少扭曲的平均等待时间。仿真结果表明,对于内存密集型基准测试,我们的技术可使整体性能平均提高10%,而最大增益可以达到30%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号