Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems

机译：结合HW / SW机制来提高多GPU系统的NUMA性能

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Historically, improvement in GPU performance has been tightly coupled with transistor scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau. To continue GPU performance scaling, multiple GPUs can be connected using system-level interconnects. However, limited inter-GPU interconnect bandwidth (e.g., 64GB/s) can hurt multi-GPU performance when there are frequent remote GPU memory accesses. Traditional GPUs rely on page migration to service the memory accesses from local memory instead. Page migration fails when the page is simultaneously shared between multiple GPUs in the system. As such, recent proposals enhance the software runtime system to replicate read-only shared pages in local memory. Unfortunately, such practice fails when there are frequent remote memory accesses to read-write shared pages. To address this problem, recent proposals cache remote shared data in the GPU last-level-cache (LLC). Unfortunately, remote data caching also fails when the shared-data working-set exceeds the available GPU LLC size. This paper conducts a combined performance analysis of state-of-the-art software and hardware mechanisms to improve NUMA performance of multi-GPU systems. Our evaluations on a 4-node multi-GPU system reveal that the combination of work scheduling, page placement, page migration, page replication, and caching remote data still incurs a 47% slowdown relative to an ideal NUMA-GPU system. This is because the shared memory footprint tends to be significantly larger than the GPU LLC size and can not be replicated by software because the shared footprint has read-write property. Thus, we show that existing NUMA-aware software solutions require hardware support to address the NUMA bandwidth bottleneck. We propose Caching Remote Data in Video Memory (CARVE), a hardware mechanism that stores recently accessed remote shared data in a dedicated region of the GPU memory. CARVE outperforms state-of-the-art NUMA mechanisms and is within 6% the performance of an ideal NUMA-GPU system. A design space analysis on supporting cache coherence is also investigated. Overall, we show that dedicating only 3% of GPU memory eliminates NUMA bandwidth bottlenecks while incurring negligible performance overheads due to the reduced GPU memory capacity.

机译：从历史上看，GPU性能的提高已经与晶体管缩放紧密相连。随着摩尔的法律减慢，单个GPU的表现最终可能最终高原。为了继续GPU性能缩放，可以使用系统级互连连接多个GPU。然而，当有频繁的远程GPU存储器访问时，Limited GPU Inter-GPU Inter-GPU互连带宽（例如，64GB / s）可能会损坏多GPU性能。传统的GPU依赖于页面迁移来服务于本地内存的内存访问。页面迁移在系统中同时在系统中的多个GPU之间共享页面时失败。因此，最近的建议增强了软件运行时系统，以在本地存储器中复制只读共享页面。不幸的是，当频繁远程内存访问读写共享页面时，这种做法失败。要解决此问题，最近的建议在GPU上次级别 - 缓存（LLC）中缓存远程共享数据。不幸的是，当共享数据工作集超过可用的GPU LLC大小时，远程数据缓存也会失败。本文对最先进的软件和硬件机制进行了组合性能分析，以提高多GPU系统的NUMA性能。我们对4节点多GPU系统的评估显示，工作调度，页面放置，页面迁移，页面复制和缓存远程数据的组合仍然会引发相对于理想的Numa-GPU系统的47％的放缓。这是因为共享内存占用占据趋势明显大于GPU LLC大小，并且无法通过软件复制，因为共享占用占地面积具有读写属性。因此，我们表明现有的NUMA感知软件解决方案需要硬件支持来解决NUMA带宽瓶颈。我们建议在视频内存中的远程数据（Carve），这是一个硬件机制，该硬件机制在GPU内存的专用区域中存储最近访问的远程共享数据。雕刻优于最先进的NUMA机制，并且在理想的NUMA-GPU系统的性能范围内。还研究了支持高速缓存相干性的设计空间分析。总的来说，我们表明，仅仅3％的GPU内存消除了Numa带宽瓶颈，而由于GPU内存容量降低，因此由于GPU存储容量降低而导致的性能开销。

著录项

来源
《International Symposium on Microarchitecture》|2018年|xxiv 493 p. :|共13页
会议地点
作者
Vinson Young; Aamer Jaleel; Evgeny Bolotin; Eiman Ebrahimi; David Nellans; Oreste Villa;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
Graphics processing units; Bandwidth; Hardware; Coherence; Random access memory; Transistors;

机译：图形处理单元;带宽;硬件;一致性;随机存取存储器;晶体管;

相似文献

外文文献
中文文献
专利

1. Quantum Fast Algorithm Computational Intelligence PT I:SW/HW Smart Toolkit [J] . Ulyanov S.V. 人工智能进展(英文) . 2019,第001期
2. Memory-aware kernel mechanism and policies for improving internode load balancing on NUMA systems [J] . Chiang Mei-Ling, Su Wei-Lun, Tu Shu-Wei, Software . 2019,第10期

机译：内存感知内核机制和策略，用于改善NUMA系统上的节点间负载平衡
3. SymS: a symmetrical scheduler to improve multi-threaded program performance on NUMA systems [J] . Liang Zhu, Hai Jin, Xiaofei Liao Concurrency and computation: practice and experience . 2015,第18期

机译：SymS：对称的调度程序，可提高NUMA系统上的多线程程序性能
4. Improved methane production from waste activated sludge by combining free ammonia with heat pretreatment: Performance, mechanisms and applications [J] . Liu Xuran, Xu Qiuxiang, Wang Dongbo, Bioresource Technology: Biomass, Bioenergy, Biowastes, Conversion Technologies, Biotransformations, Production Technologies . 2018,第期

机译：通过将自由氨与热预处理结合，改善废活性污泥的甲烷生产：性能，机制和应用
5. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems [C] . Vinson Young, Aamer Jaleel, Evgeny Bolotin, . 2018

机译：结合硬件/软件机制以提高多GPU系统的NUMA性能
6. The NUMA page migration/page replication ASIC {lcub}NPMR{rcub}: A chip design to improve memory system performance in a Non-Uniform Memory Access (NUMA) multiprocessor system architecture. [D] . Kelly, Terence James. 2000

机译：NUMA页面迁移/页面复制ASIC {lcub} NPMR {rcub}：一种芯片设计，用于在非统一内存访问（NUMA）多处理器系统体系结构中提高内存系统性能。
7. Impact of industrial production system parameters on chicken microbiomes: mechanisms to improve performance and reduce Campylobacter [O] . Aaron McKenna, Umer Zeeshan Ijaz, Carmel Kelly, 2020

机译：工业生产系统参数对鸡微生物的影响：提高性能和减少弯曲杆菌的机制
8. A cyber-physical approach to combined HW-SW monitoring for improving energy efficiency in data centers [O] . Pagán Ortiz Josué, Zapater Sancho Marina, Cubo Oscar, 2013

机译：结合硬件-软件监视的网络物理方法，可提高数据中心的能源效率
9. Combined HW/SW Reliability Models [R] . James, L. E., Angus, J. E., Bowen, J. B., 1982

机译：组合的硬件/软件可靠性模型

Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems

摘要

著录项

相似文献

相关主题

期刊订阅