Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

机译：为InfiniBand GPU集群上的节点间MPI通信设计高效的小消息传输机制

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.

机译：越来越多的MPI应用程序正在移植以利用GPU提供的计算能力。 GPU群集上的数据移动仍然是阻碍科学应用程序充分利用GPU潜力的主要瓶颈。早期，GPU-GPU节点间通信必须先将数据从GPU内存移至主机内存，然后再通过网络发送数据。 MPI库（例如MVAPICH2）已经提供了解决方案，以使用基于主机的流水线技术来缓解这一瓶颈。除此之外，新推出的GPU Direct RDMA（GDR）是有前途的解决方案，可以进一步解决此数据移动瓶颈。但是，MPI库中的现有设计对所有消息大小都应用了会合协议，由于进行了额外的同步消息交换，因此小消息通信会产生相当大的开销。在本文中，我们提出了新技术来优化小消息大小的节点间GPU到GPU的通信。我们支持急切协议的设计包括在发送方和接收方的有效支持。此外，我们提出了一条新的数据路径，以提供主机和GPU内存之间的快速副本。据我们所知，这是第一项使用渴望的协议为小消息量的GPU通信提出有效设计的研究。我们的实验结果表明，GPU到GPU和CPU到GPU的点对点通信的延迟分别减少了59％和63％。这些设计将单向带宽分别提高了7.3倍和1.7倍。我们还通过两个最终应用评估了我们提出的设计：GPULBM和HOOMD-blue。开普勒GPU的性能数据表明，与现有最好的GDR设计相比，我们提出的设计使GPULBM的延迟分别降低了23.4％，而HOOMD-blue的平均TPS则提高了58％。

著录项

来源
《International Conference on High Performance Computing》|2014年|1-10|共10页
会议地点
作者
Rong Shi; Potluri Sreeram; Hamidouche Khaled; Perkins Jonathan; Mingzhe Li; Rossetti Davide; Panda Dhabaleswar K.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
CUDA; GPU Direct RDMA; InfiniBand; MPI;

机译：CUDA; GPU Direct RDMA; InfiniBand; MPI;

相似文献

外文文献
中文文献
专利

1. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters [J] . Hao Wang, Sreeram Potluri, Miao Luo, Computer science . 2011,第3a4期

机译：MVAPICH2-GPU：针对InfiniBand集群优化了GPU与GPU的通信
2. GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters [J] . E. Agostini, D. Rossetti, S. Potluri Journal of Parallel and Distributed Computing . 2018,第APRa期

机译：GPUDirect异步：探索适用于InfiniBand集群的GPU同步通信技术
3. The development of Mellanox/NVIDIA GPUDirect over InfiniBand-a new model for GPU to GPU communications [J] . Gilad Shainer, Ali Ayoub, Pak Lui, Computer science . 2011,第3a4期

机译：基于InfiniBand的Mellanox / NVIDIA GPUDirect的开发-一种GPU与GPU之间通信的新模型
4. Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters [C] . Rong Shi, Potluri Sreeram, Hamidouche Khaled, International Conference on High Performance Computing . 2014

机译：在Infiniband GPU集群上设计节点间MPI通信的高效小消息传输机制
5. Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors. [D] . Luo, Miao. 2013

机译：使用InfiniBand，加速器和协处理器为多核集群设计高效的MPI和UPC运行时。
6. High Performance Data Clustering: A Comparative Analysis of Performance for GPU RASC MPI and OpenMP Implementations [O] . Luobin Yang, Steve C. Chiu, Wei-Keng Liao, -1

机译：高性能数据集群：GPURASCMPI和OpenMP实现的性能比较分析
7. Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters using Shared Memory Backed Windows [O] . Sreeram Potluri, Hao Wang, Vijay Dhanraj, 2013

机译：使用共享内存支持Windows优化多核InfiniBand群集上的mpI单面通信

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅