首页> 外文会议>2016 First Workshop on Optimization of Communication in HPC >Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications
【24h】

Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications

机译:在启用GPU的流应用程序中对基于硬件多播的广播的高效可靠性支持

获取原文
获取原文并翻译 | 示例

摘要

Streaming applications, which are data-intensive, have been extensively run on High-Performance Computing (HPC) systems to seek the higher performance and scalability. These applications typically utilize broadcast operations to disseminate in real-time data from a single source to multiple workers, each being a multi-GPU based computing site. State-of-the-art broadcast operations take advantage of InfiniBand (IB) hardware multicast (MCAST) and NVIDIA GPUDirect features to boost inter-node communications performance and scalability. The IB MCAST feature works only with the IB Unreliable Datagram (UD) mechanism and consequently provides unreliable communication for applications. Higher-level libraries and/or runtime environments must handle and provide reliability explicitly. However, handling reliability at that level can be a performance bottleneck for streaming applications. In this paper, we analyze the specific requirements of streaming applications and the performance bottlenecks involved in handling reliability. We show that the traditional Negative-Acknowledgement (NACK) based approach requires the broadcast sender to perform retransmissions for lost packets, degrading streaming throughput. To alleviate this issue, we propose a novel Remote Memory Access (RMA) based scheme to provide high-performance reliability support at the MPI-level. In the proposed scheme, the receivers themselves (as opposed to the sender) retrieve lost packets through RMA operations. Furthermore, we provide an analytical model to illustrate the memory requirements of the proposed RMA-based scheme. Our experimental results show that the proposed scheme introduces nearly no overhead compared to the existing solutions. In a micro-benchmark with injected failures (to simulate unreliable network environments), the proposed scheme shows up to 45% reduction in latency compared to the existing NACK-based scheme. Moreover, with a synthetic streaming benchmark, our design also shows up to a 56% higher broadcast rate compared to the traditional NACK-based scheme on a GPU-dense Cray CS-Storm system with up to 88 NVIDIA K80 GPU cards.
机译:数据密集型流应用程序已在高性能计算(HPC)系统上广泛运行,以寻求更高的性能和可伸缩性。这些应用程序通常利用广播操作将实时数据从单一来源传播到多个工作人员,每个工作人员都是基于多GPU的计算站点。最新的广播操作利用InfiniBand(IB)硬件多播(MCAST)和NVIDIA GPUDirect功能来提高节点间通信的性能和可伸缩性。 IB MCAST功能仅与IB不可靠数据报(UD)机制一起使用,因此为应用程序提供了不可靠的通信。更高级别的库和/或运行时环境必须明确处理并提供可靠性。但是,在该级别上处理可靠性可能是流应用程序的性能瓶颈。在本文中,我们分析了流应用程序的特定要求以及处理可靠性时涉及的性能瓶颈。我们表明,传统的基于否定确认(NACK)的方法要求广播发送方对丢失的数据包执行重传,从而降低了流吞吐量。为了缓解此问题,我们提出了一种新颖的基于远程内存访问(RMA)的方案,以在MPI级别提供高性能的可靠性支持。在提出的方案中,接收者本身(与发送者相对)通过RMA操作检索丢失的数据包。此外,我们提供了一个分析模型来说明所提出的基于RMA的方案的内存要求。我们的实验结果表明,与现有解决方案相比,该方案几乎没有引入任何开销。在具有注入故障的微基准测试中(模拟不可靠的网络环境),与现有的基于NACK的方案相比,所提出的方案可将延迟降低多达45%。此外,在具有多达88个NVIDIA K80 GPU卡的GPU密集型Cray CS-Storm系统上,通过合成流基准测试,我们的设计还显示出比传统的基于NACK的方案高出56%的广播速率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号