首页> 外文会议>International Conference on High Performance Computing >Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters
【24h】

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters

机译:为InfiniBand GPU集群上的节点间MPI通信设计高效的小消息传输机制

获取原文
获取外文期刊封面目录资料

摘要

Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.
机译:越来越多的MPI应用程序正在移植以利用GPU提供的计算能力。 GPU群集上的数据移动仍然是阻碍科学应用程序充分利用GPU潜力的主要瓶颈。早期,GPU-GPU节点间通信必须先将数据从GPU内存移至主机内存,然后再通过网络发送数据。 MPI库(例如MVAPICH2)已经提供了解决方案,以使用基于主机的流水线技术来缓解这一瓶颈。除此之外,新推出的GPU Direct RDMA(GDR)是有前途的解决方案,可以进一步解决此数据移动瓶颈。但是,MPI库中的现有设计对所有消息大小都应用了会合协议,由于进行了额外的同步消息交换,因此小消息通信会产生相当大的开销。在本文中,我们提出了新技术来优化小消息大小的节点间GPU到GPU的通信。我们支持急切协议的设计包括在发送方和接收方的有效支持。此外,我们提出了一条新的数据路径,以提供主机和GPU内存之间的快速副本。据我们所知,这是第一项使用渴望的协议为小消息量的GPU通信提出有效设计的研究。我们的实验结果表明,GPU到GPU和CPU到GPU的点对点通信的延迟分别减少了59%和63%。这些设计将单向带宽分别提高了7.3倍和1.7倍。我们还通过两个最终应用评估了我们提出的设计:GPULBM和HOOMD-blue。开普勒GPU的性能数据表明,与现有最好的GDR设计相比,我们提出的设计使GPULBM的延迟分别降低了23.4%,而HOOMD-blue的平均TPS则提高了58%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号