Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

机译：使用CUDA进程间通信在多GPU系统上优化MPI通信

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.

机译：许多现代集群的每个节点都配备了多个GPU，以实现更好的计算密度和能效。但是，将数据移入/移出GPU仍然是主要的性能瓶颈。 NVIDIA在CUDA 4.1中引入了进程间通信（IPC），以解决使用连接到同一节点的不同GPU的进程之间的数据移动开销。诸如MVAPICH2之类的最新MPI库已被修改，以允许应用程序开发人员直接在GPU设备内存上使用MPI调用。通过消除处理复杂数据移动优化的负担，这提高了应用程序开发人员的可编程性。在本文中，我们利用CUDA中提供的IPC功能，为多GPU节点上的节点内MPI通信提出了有效的设计。我们还将演示MPI单方通信语义如何通过利用IPC和GPU上的直接内存访问（DMA）引擎提供更好的性能和重叠。我们使用微基准测试和应用程序演示了我们设计的有效性。拟议的设计将4MByte消息的GPU到GPU MPI发送/接收延迟提高了79％，并且在相同消息大小的情况下实现了4倍的带宽。与现有的基于发送/接收的实现相比，使用Put和Active同步进行的单边通信显示4MByte消息的延迟提高了74％。我们使用Get和Passive Synchronization进行的基准测试表明，使用IPC和GPU DMA引擎可以实现真正的异步进度。与使用MVAPICH2中现有设计的性能相比，我们用于双向和一侧通信的设计将GPULBM（一种用于多相流的莱迪思玻耳兹曼方法的CUDA实现）的性能提高了16％。据我们所知，这是第一篇使用CUDA IPC为节点内MPI双面和双面GPU到GPU通讯提供全面解决方案的论文。

著录项

来源
《2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops amp; PhD Forum》|2012年|p.1848- 1857|共10页
会议地点 Shanghai(CN)
作者
Potluri S.; Wang H.; Bureddy D.; Singh A.K.; Rosales C.; Panda Dhabaleswar K.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类程序设计、软件工程;
关键词

相似文献

外文文献
中文文献
专利

1. Leveraging MPI RMA to optimize halo-swapping communications inMONCon Cray machines [J] . Nick Brown, Michae Bareford, MicheleWeiland Concurrency and Computation . 2019,第16期

机译：利用MPI RMA优化MONCon Cray机器中的光环交换通信
2. Leveraging MPI RMA to optimize halo-swapping communications inMONCon Cray machines [J] . Nick Brown, Michae Bareford, MicheleWeiland Concurrency and Computation . 2019,第16期

机译：利用MPI RMA优化Halo交换通信在Cray Machines中
3. Using hybrid MPI and OpenMP programming to optimize communications in parallel loop self-scheduling schemes for multicore PC clusters [J] . Chao-Chin Wu, Lien-Fu Lai, Chao-Tung Yang, Journal of supercomputing . 2012,第1期

机译：使用混合MPI和OpenMP编程来优化多核PC集群的并行循环自调度方案中的通信
4. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication [C] . S. Potluri, H. Wang, D. Bureddy, IEEE International Parallel and Distributed Processing Symposium . 2012

机译：使用CUDA跨流程通信优化多GPU系统的MPI通信
5. Optimizing multi-dimensional MPI communications on multi-core architectures. [D] . Karlsson, Christer. 2012

机译：在多核体系结构上优化多维MPI通信。
6. Optimized Analog Multi-Band Carrierless Amplitude and Phase Modulation for Visible Light Communication-Based Internet of Things Systems [O] . Luis Rodrigues, Mónica Figueiredo, Luis Nero Alves 2021

机译：优化的模拟多频带载波和基于可见光通信互联网的相位调制
7. Effective multi-GPU communication using multiple CUDA streams and threads [O] . Mohammed Sourouri, Tor Gillberg, Scott B. Baden, 2014

机译：使用多个CUDA流和线程的有效多GPU通信

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅