Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

机译：设计高性能异构广播，用于GPU集群上的流媒体应用

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source-typically live-to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.

机译：高性能流媒体应用正在开始利用由图形处理单元（GPU）提供的计算能力和网络的高吞吐量的高性能互连提供如InfiniBand（IB），以提高它们的性能和可扩展性。这些应用在很大程度上依赖于广播业务来移动数据，存储在主机内存，从单一来源 - 通常活到多个基于GPU计算的网站。虽然同质广播设计将IB硬件组播功能的优势，以提高他们的表现，他们的异构对口需要主机和GPU之间的明确数据移动，该显著伤害了整体性能。有有效的异构广播设计的用于流应用的尤其是对新出现的多GPU配置缺乏。在这项工作中，我们提出了新的技术，以充分和共同地采取NVIDIA GPUDirect RDMA（GDR）的优势，CUDA进程间通信（IPC）和IB硬件组播功能，以设计出高性能异构广播业务为现代的多GPU系统。我们建议节点内，拓扑感知方案，最大限度的性能优势，同时尽量减少宝贵的PCIe资源利用率。此外，我们通过用CUDA IPC操作重叠GDR + IB硬件组播业务优化通信管道。相比于现有的解决方案，我们的设计出现在一个异构的广播操作的等待时间提高了3倍。我们的设计还出现在一个GPU密集克雷CS-风暴系统88分的GPU上的流基准的执行时间，67％的改善。

著录项

来源
《International Symposium on Computer Architecture and High Performance Computing》|2016年|xix 223 p.|共8页
会议地点
作者
C.-H. Chu; K. Hamidouche; H. Subramoni; A. Venkatesh; B. Elton; D.K. Panda;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
Graphics processing units; Hardware; Peer-to-peer computing; Performance evaluation; Sockets; Scalability;

机译：图形处理单元;硬件;点对点计算;性能评估;插座;可扩展性;

相似文献

外文文献
中文文献
专利

1. Finding exact hitting set solutions for systems biology applications using heterogeneous GPU clusters [J] . Danilo Carastan-Santos, Raphael Y. de Camargo, David C. Martins Jr., Future generation computer systems . 2017,第feba期

机译：使用异构GPU集群为系统生物学应用找到精确的命中集解决方案
2. Hierarchical k-Nearest Neighbor with GPUs and a High Performance Cluster: Application to Handwritten Character Recognition [J] . Cecotti Hubert International Journal of Pattern Recognition and Artificial Intelligence . 2017,第2期

机译：具有GPU和高性能集群的分层k最近邻：在手写字符识别中的应用
3. Performance Optimisation of Parallelized ADAS Applications in FPGA-GPU Heterogeneous Systems: A Case Study With Lane Detection [J] . Xiebing Wang, Kai Huang, Alois Knoll IEEE Transactions on Intelligent Vehicles . 2019,第4期

机译：FPGA-GPU异构系统中并行化ADAS应用的性能优化：LANE检测案例研究
4. Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters [C] . C.-H. Chu, K. Hamidouche, H. Subramoni, IEEE International Symposium on Computer Architecture and High Performance Computing . 2016

机译：为GPU集群上的流应用程序设计高性能异构广播
5. Designing High-Performance Remote Memory Access for MPI and PGAS Models with Modern Networking Technologies on Heterogeneous Clusters [D] . Li, Mingzhe. 2017

机译：使用异构集群上的现代网络技术设计MPI和PGAS模型的高性能远程内存访问
6. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs GPUs and MICs: A Case Study with Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Guilherme Andrade, -1

机译：具有多核CPUGPU和MIC的系统上的应用程序性能分析和高效执行：以显微镜图像分析为例
7. Online Performance Projection for Clusters with Heterogeneous GPUs [O] . 2015

机译：具有异构GpU的群集的在线性能预测
8. Using Heterogeneous High Performance Computing Cluster for Supporting Fine-Grained Parallel Applications [R] . Abu-Gazaleh, N. 2006

机译：使用异构高性能计算集群支持细粒度并行应用

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

摘要

著录项

相似文献

相关主题

期刊订阅