MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

Akhil Aninkumar; Evgeny Bolotin; Benjamin ChoT Ugljesa Milic; Eiman Ebrahimi; Oreste Villa; Aamer Jaleel; Carole-Jean Wu; David Nellans

首页> 外文期刊>Computer architecture news >MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

【24h】

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

机译：MCM-GPU：多芯片模块GPU，可实现持续的性能可扩展性

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monohthic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this paper we demonstrate that package-level integration of multiple GPU modules to build larger logical GPUs can enable continuous performance scaling beyond Moore's law. Specifically, we propose partitioning GPUs into easily manufacturable basic GPU Modules (GPMs), and integrating them on package using high bandwidth and power efficient signaling technologies. We lay out the details and evaluate the feasibility of a basic Multi-Chip-Module GPU (MCM-GPU) design. We then propose three architectural optimizations that significantly improve GPM data locality and minimize the sensitivity on inter-GPM bandwidth. Our evaluation shows that the optimized MCM-GPU achieves 22.8% speedup and 5x inter-GPM bandwidth reduction when compared to the basic MCM-GPU architecture. Most importantly, the optimized MCM-GPU design is 45.5% faster than the largest implementable monolithic GPU, and performs within 10% of a hypothetical (and unbuildable) monohthic GPU. Lastly we show that our optimized MCM-GPU is 26.8% faster than an equally equipped Multi-GPU system with the same total number of SMs and DRAM bandwidth.

机译：从历史上看，基于GPU的高性能计算的改进与晶体管缩放紧密相关。随着摩尔定律放慢，每个芯片的晶体管数量不再以历史速度增长，单个单芯片GPU的性能曲线最终将趋于平稳。但是，在许多领域中，对高性能GPU的需求仍然存在。为了满足这一需求，在本文中，我们证明了多个GPU模块的封装级集成以构建更大的逻辑GPU可以实现超越摩尔定律的连续性能扩展。具体来说，我们建议将GPU划分为易于制造的基本GPU模块（GPM），并使用高带宽和高能效信令技术将它们集成到封装中。我们对细节进行布局，并评估基本的多芯片模块GPU（MCM-GPU）设计的可行性。然后，我们提出了三种体系结构优化，可以显着提高GPM数据的局部性并最大程度地降低GPM间带宽的敏感性。我们的评估表明，与基本的MCM-GPU架构相比，经过优化的MCM-GPU可以实现22.8％的加速，并且GPM间带宽减少5倍。最重要的是，优化的MCM-GPU设计比最大的可实现单片GPU快45.5％，并且在假设的（且不可构建的）单片GPU的10％内执行。最后，我们证明，经过优化的MCM-GPU比具有相同SM和DRAM带宽总数的同等配备的Multi-GPU系统快26.8％。

著录项

来源
《Computer architecture news》 |2017年第2期|320-332|共13页
作者
Akhil Aninkumar; Evgeny Bolotin; Benjamin ChoT Ugljesa Milic; Eiman Ebrahimi; Oreste Villa; Aamer Jaleel; Carole-Jean Wu; David Nellans;
展开▼
作者单位

Arizona State University;

NVIDIA;

University of Texas at Austin;

Barcelona Supercomputing Center / Universitat Politecnica de Catalunya;

NVIDIA;

NVIDIA;

Arizona State University;

NVIDIA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Graphics Processing Units; Multi-Chip-Modules; NUMA Systems; Moore's Law;

机译：图形处理单元;多芯片模块;NUMA系统;摩尔定律;

相似文献

外文文献
中文文献
专利

1. FPGA-based tsunami simulation: Performance comparison with GPUs, and roofline model for scalability analysis [J] . Kohei Nagasu, Kentaro Sano, Fumiya Kono, Journal of Parallel and Distributed Computing . 2017,第auga期

机译：基于FPGA的海啸仿真：与GPU的性能比较以及用于可扩展性分析的Roofline模型
2. A Comparative Analysis of the Performance of Scalable Parallel Patterns Applied to Genetic Algorithms and Configured for NVIDIA GPUs [J] . David Radford, David Calvert Procedia Computer Science . 2017,第1期

机译：应用于遗传算法并为NVIDIA GPU配置的可扩展并行模式性能的比较分析
3. Performance and Scalability of the JCSDA Community Radiative Transfer Model (CRTM) on NVIDIA GPUs [J] . Mielikainen Jarno, Huang Bormin, Huang Hung-Lung Allen, Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of . 2015,第4期

机译：NVIDIA GPU上JCSDA社区辐射传输模型（CRTM）的性能和可扩展性
4. MCM-GPU: Multi-chip-module GPUs for continued performance scalability [C] . Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, ACM/IEEE Annual International Symposium on Computer Architecture . 2017

机译：MCM-GPU：多芯片模块GPU，可提供持续的性能可扩展性
5. An Approach for Large-Scale Three-Dimensional FFT-Based Approximate Convolutions on GPUs [D] . Kulkarni, Anuva Abhijit. 2020

机译：GPU大规模三维FFT近似卷积的方法
6. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [O] . Yuling Fang, Qingkui Chen, Neal N. Xiong, 2017

机译：RGCA：基于有效性能-能源优化的可靠的GPU集群架构用于大规模物联网计算
7. A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs [O] . Li-wen Chang, John A. Stratton, Hee-seok Kim, 2013

机译：使用GPU的可扩展，数值稳定，高性能的对角线求解器

MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability

摘要

著录项

相似文献

相关主题

期刊订阅