首页> 外文会议>2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops amp; PhD Forum >Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication
【24h】

Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication

机译:使用CUDA进程间通信在多GPU系统上优化MPI通信

获取原文
获取原文并翻译 | 示例

摘要

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced Inter-Process Communication (IPC) to address data movement overheads between processes using different GPUs connected to the same node. State-of-the-art MPI libraries like MVAPICH2 are being modified to allow application developers to use MPI calls directly over GPU device memory. This improves the programmability for application developers by removing the burden of dealing with complex data movement optimizations. In this paper, we propose efficient designs for intra-node MPI communication on multi-GPU nodes, taking advantage of IPC capabilities provided in CUDA. We also demonstrate how MPI one-sided communication semantics can provide better performance and overlap by taking advantage of IPC and the Direct Memory Access (DMA) engine on a GPU. We demonstrate the effectiveness of our designs using micro-benchmarks and an application. The proposed designs improve GPU-to-GPU MPI Send/Receive latency for 4MByte messages by 79% and achieve 4 times the bandwidth for the same message size. One-sided communication using Put and Active synchronization shows 74% improvement in latency for 4MByte message, compared to the existing Send/Receive based implementation. Our benchmark using Get and Passive Synchronization demonstrates that true asynchronous progress can be achieved using IPC and the GPU DMA engine. Our designs for two-sided and one-sided communication improve the performance of GPULBM, a CUDA implementation of Lattice Boltzmann Method for multiphase flows, by 16%, compared to the performance using existing designs in MVAPICH2. To the best of our knowledge, this is the first paper to provide a comprehensive solution for MPI two-sided and one-sided GPU-to-GPU communication within a node, using CUDA IPC.
机译:许多现代集群的每个节点都配备了多个GPU,以实现更好的计算密度和能效。但是,将数据移入/移出GPU仍然是主要的性能瓶颈。 NVIDIA在CUDA 4.1中引入了进程间通信(IPC),以解决使用连接到同一节点的不同GPU的进程之间的数据移动开销。诸如MVAPICH2之类的最新MPI库已被修改,以允许应用程序开发人员直接在GPU设备内存上使用MPI调用。通过消除处理复杂数据移动优化的负担,这提高了应用程序开发人员的可编程性。在本文中,我们利用CUDA中提供的IPC功能,为多GPU节点上的节点内MPI通信提出了有效的设计。我们还将演示MPI单方通信语义如何通过利用IPC和GPU上的直接内存访问(DMA)引擎提供更好的性能和重叠。我们使用微基准测试和应用程序演示了我们设计的有效性。拟议的设计将4MByte消息的GPU到GPU MPI发送/接收延迟提高了79%,并且在相同消息大小的情况下实现了4倍的带宽。与现有的基于发送/接收的实现相比,使用Put和Active同步进行的单边通信显示4MByte消息的延迟提高了74%。我们使用Get和Passive Synchronization进行的基准测试表明,使用IPC和GPU DMA引擎可以实现真正的异步进度。与使用MVAPICH2中现有设计的性能相比,我们用于双向和一侧通信的设计将GPULBM(一种用于多相流的莱迪思玻耳兹曼方法的CUDA实现)的性能提高了16%。据我们所知,这是第一篇使用CUDA IPC为节点内MPI双面和双面GPU到GPU通讯提供全面解决方案的论文。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号