首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast
【24h】

Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast

机译:利用硬件多播和GPUDirect RDMA进行高效广播

获取原文
获取原文并翻译 | 示例

摘要

Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.
机译:广播是许多流媒体和深度学习应用程序中广泛使用的一种操作,用于在新兴的异构高性能计算(HPC)系统上传播大量数据。但是,传统的广播方案没有完全利用基于图形处理单元(GPU)的应用程序的硬件功能。本文提出了一种面向模型的分析方法,以识别GPU集群上现有广播方案的性能瓶颈。接下来,提出了基于流的广播方案,以利用InfiniBand硬件多播(IB-MCAST)和NVIDIA GPUDirect技术进行有效的消息传输。在使用基于消息传递接口(MPI)的基准和应用程序的上下文中评估了建议的设计。实验结果表明,与基准级别评估中的最新解决方案相比,可伸缩性得到了改善,延迟减少了多达82%。此外,与最新技术相比,拟议的设计可为合成流工作负载提供稳定的更高吞吐量,而深度学习框架的培训时间则可提高1.3倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号