【24h】

Beyond the Socket: NUMA-Aware GPUs

机译:超越套接字:numa感知gpus

获取原文

摘要

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.
机译:GPU通过采用许多小单指令多线(SIMT)核来实现高吞吐量和功率效率。为了最小化调度逻辑和性能方差,它们利用统一的存储器系统并利用通过编程模型暴露的强大数据并行性。随着摩尔的法律放缓,对于GPU来继续缩放性能(这主要取决于SIMT核心计数),它们可能会拥抱多插槽设计,其中晶体管更容易获得。然而,当移动到这种设计时,保持均匀存储器系统的错觉越来越困难。在这项工作中,我们调查多套接字非统一内存访问(NUMA)GPU设计,并显示GPU互连和缓存架构需要显着的更改以实现性能可扩展性。我们表明可以利用应用程序阶段效果,允许GPU套接字动态优化其各个互连和高速缓存策略,从而最大限度地减少NUMA效果的影响。我们的NUMA感知GPU分别优于1.5倍,2.3×和3.2倍的单个GPU,同时分别在2,4和8个插座设计中实现了89%,84%和76%的理论应用可扩展性。今天可实现,Numa感知多套接GPU可能是用于缩放GPU性能超出单个套接字的有希望的候选者。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号