首页> 外文会议>International Symposium on Microarchitecture >Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems
【24h】

Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems

机译:结合HW / SW机制来提高多GPU系统的NUMA性能

获取原文

摘要

Historically, improvement in GPU performance has been tightly coupled with transistor scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau. To continue GPU performance scaling, multiple GPUs can be connected using system-level interconnects. However, limited inter-GPU interconnect bandwidth (e.g., 64GB/s) can hurt multi-GPU performance when there are frequent remote GPU memory accesses. Traditional GPUs rely on page migration to service the memory accesses from local memory instead. Page migration fails when the page is simultaneously shared between multiple GPUs in the system. As such, recent proposals enhance the software runtime system to replicate read-only shared pages in local memory. Unfortunately, such practice fails when there are frequent remote memory accesses to read-write shared pages. To address this problem, recent proposals cache remote shared data in the GPU last-level-cache (LLC). Unfortunately, remote data caching also fails when the shared-data working-set exceeds the available GPU LLC size. This paper conducts a combined performance analysis of state-of-the-art software and hardware mechanisms to improve NUMA performance of multi-GPU systems. Our evaluations on a 4-node multi-GPU system reveal that the combination of work scheduling, page placement, page migration, page replication, and caching remote data still incurs a 47% slowdown relative to an ideal NUMA-GPU system. This is because the shared memory footprint tends to be significantly larger than the GPU LLC size and can not be replicated by software because the shared footprint has read-write property. Thus, we show that existing NUMA-aware software solutions require hardware support to address the NUMA bandwidth bottleneck. We propose Caching Remote Data in Video Memory (CARVE), a hardware mechanism that stores recently accessed remote shared data in a dedicated region of the GPU memory. CARVE outperforms state-of-the-art NUMA mechanisms and is within 6% the performance of an ideal NUMA-GPU system. A design space analysis on supporting cache coherence is also investigated. Overall, we show that dedicating only 3% of GPU memory eliminates NUMA bandwidth bottlenecks while incurring negligible performance overheads due to the reduced GPU memory capacity.
机译:从历史上看,GPU性能的提高已经与晶体管缩放紧密相连。随着摩尔的法律减慢,单个GPU的表现最终可能最终高原。为了继续GPU性能缩放,可以使用系统级互连连接多个GPU。然而,当有频繁的远程GPU存储器访问时,Limited GPU Inter-GPU Inter-GPU互连带宽(例如,64GB / s)可能会损坏多GPU性能。传统的GPU依赖于页面迁移来服务于本地内存的内存访问。页面迁移在系统中同时在系统中的多个GPU之间共享页面时失败。因此,最近的建议增强了软件运行时系统,以在本地存储器中复制只读共享页面。不幸的是,当频繁远程内存访问读写共享页面时,这种做法失败。要解决此问题,最近的建议在GPU上次级别 - 缓存(LLC)中缓存远程共享数据。不幸的是,当共享数据工作集超过可用的GPU LLC大小时,远程数据缓存也会失败。本文对最先进的软件和硬件机制进行了组合性能分析,以提高多GPU系统的NUMA性能。我们对4节点多GPU系统的评估显示,工作调度,页面放置,页面迁移,页面复制和缓存远程数据的组合仍然会引发相对于理想的Numa-GPU系统的47%的放缓。这是因为共享内存占用占据趋势明显大于GPU LLC大小,并且无法通过软件复制,因为共享占用占地面积具有读写属性。因此,我们表明现有的NUMA感知软件解决方案需要硬件支持来解决NUMA带宽瓶颈。我们建议在视频内存中的远程数据(Carve),这是一个硬件机制,该硬件机制在GPU内存的专用区域中存储最近访问的远程共享数据。雕刻优于最先进的NUMA机制,并且在理想的NUMA-GPU系统的性能范围内。还研究了支持高速缓存相干性的设计空间分析。总的来说,我们表明,仅仅3%的GPU内存消除了Numa带宽瓶颈,而由于GPU内存容量降低,因此由于GPU存储容量降低而导致的性能开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号