CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

机译：大规模GPU集群上基于CUDA内核的集体约简操作

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.

机译：像NVIDIA GPU这样的加速器已经在很大程度上改变了当前HPC集群的格局。这些加速器提供的大规模异构并行性已导致可识别GPU的MPI库被广泛用于编写分布式并行科学应用程序。像MPI_Reduce这样的面向计算的集合运算除了对集合进行的常规通信外，还对数据执行计算。从历史上看，由于它们的计算要求，这些集合仅在CPU（或主机）上实现。但是，随着GPU技术的出现，MPI库为其基于GPU（或设备）的版本提供更好的设计已经变得很重要。在本文中，我们解决了上述挑战，并为GPU群集的最常用的面向计算的集合（MPI_Reduce，MPI_Allreduce和MPI_Scan）提供了设计和实现。我们提出了对最新算法的扩展，以充分利用GPUDirect RDMA（GDR）和CUDA计算内核之类的GPU功能来有效地执行这些操作。通过我们的新设计，我们报告减少了多达96个GPU的所有基于计算的集合的执行时间。实验结果表明，使用MPI_Reduce可使小消息提高50％，大消息提高85％。对于MPI_Allreduce和MPI_Scan，我们报告大邮件的时间减少了40％以上。此外，开发并评估了分析模型，以了解和预测针对超大规模GPU集群的拟议设计的性能。

著录项

来源
《IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing》|2016年|726-735|共10页
会议地点
作者
Ching-Hsiang Chu; Khaled Hamidouche; Akshay Venkatesh; Ammar Ahmad Awan; Dhabaleswar K. Panda;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Graphics processing units; Kernel; Algorithm design and analysis; Libraries; Performance evaluation; Parallel processing; Acceleration;

机译：图形处理单元;内核;算法设计与分析;库;性能评估;并行处理;加速;

相似文献

外文文献
中文文献
专利

1. Large-scale genome-wide association studies on a GPU cluster using a CUDA-accelerated PGAS programming model [J] . Gonzalez-Dominguez Jorge, Kaessens Jan Christian, Wienbrandt Lars, Experimental Mechanics . 2015,第4期

机译：使用CUDA加速的PGAS编程模型在GPU集群上进行大规模的全基因组关联研究
2. Toward a Multi-level Parallel Framework on GPU Cluster with PetSC-CUDA for PDE-based Optical Flow Computation [J] . S. Cuomo, A. Galletti, G. Giunta, Procedia Computer Science . 2015,第1期

机译：面向具有PetSC-CUDA的GPU集群上的多级并行框架，以进行基于PDE的光流计算
3. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [J] . Yuling Fang, Qingkui Chen, Neal N. Xiong, Sensors . 2017,第8期

机译：RGCA：基于有效性能-能量优化的可靠的GPU集群体系结构，用于大规模物联网计算
4. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters [C] . Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing . 2016

机译：基于CUDA内核的大规模GPU集群集体减少操作
5. An MPI-CUDA implementation of a model for calcium induced calcium release in a three-dimensional heart cell on a hybrid CPU/GPU cluster [D] . Huang, Xuan 2015

机译：MPI-CUDA模型在混合CPU / GPU集群上的三维心脏细胞中钙诱导的钙释放的模型实现
6. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [O] . Yuling Fang, Qingkui Chen, Neal N. Xiong, 2017

机译：RGCA：基于有效性能-能源优化的可靠的GPU集群架构用于大规模物联网计算
7. High Performance Twitter Sentiment Analysis Using CUDA Based Distance Kernel on GPUs [O] . 2019

机译：高性能推特情绪分析使用基于CUDA的GPU距离内核

相关主题

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

摘要

著录项

相似文献

相关主题

期刊订阅