首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

【24h】

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

机译：基于CUDA内核的大规模GPU集群集体减少操作

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.

机译：像NVIDIA GPU这样的加速器已经在很大程度上改变了当前HPC集群的景观。这些加速器提供的大规模异构并行性导致GPU感知MPI库，广泛用于编写分布式并行科学应用。除了由集体执行的通常通信之外，MPI_Reduce等计算的集体集体操作还会对数据进行计算。从历史上看，由于它们的计算要求，这些集体已经在CPU（或主机）上实现了。然而，随着GPU技术的出现，对于MPI库来说，对于基于GPU（或设备）的版本来说，它已经重要。在本文中，我们解决了上述挑战，为最常用的计算导向集集团 - MPI_Reduce，MPI_allReduce和MPI_Scan提供了设计和实现 - 用于GPU集群。我们向最先进的算法提出扩展，充分利用GPudirect RDMA（GDR）和CUDA计算内核等GPU功能，以有效地执行这些操作。通过我们的新设计，我们报告了所有基于Compute的集体的执行时间，高达96 GPU。实验结果表明，对于使用MPI_REDUCE的大型信息，对小消息的提高和85％的提高。对于MPI_AllReduce和MPI_Scan，我们报告的时间超过40％以获得大型消息的时间。此外，开发和评估分析模型，以了解和预测极大的GPU集群的提出设计的性能。

著录项

来源
《IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 》|2016年|xxxiii 750 p. :|共10页
会议地点
作者
Ching-Hsiang Chu; Khaled Hamidouche; Akshay Venkatesh; Ammar Ahmad Awan; Dhabaleswar K. Panda;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301-532;
关键词
Graphics processing units; Kernel; Algorithm design and analysis; Libraries; Performance evaluation; Parallel processing; Acceleration;

机译：图形处理单元;内核;算法设计和分析;图书馆;性能评估;并行处理;加速;

相似文献

外文文献
中文文献
专利

1. Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using multiple GPUs with CUDA and MPI [J] . Dawei Mu, Po Chen, Liqiang Wang 地震学报（英文版） . 2013 ,第006期
2. Large-scale genome-wide association studies on a GPU cluster using a CUDA-accelerated PGAS programming model [J] . Gonzalez-Dominguez Jorge, Kaessens Jan Christian, Wienbrandt Lars, Experimental Mechanics . 2015 ,第4期

机译：使用CUDA加速的PGAS编程模型在GPU集群上进行大规模的全基因组关联研究
3. Toward a Multi-level Parallel Framework on GPU Cluster with PetSC-CUDA for PDE-based Optical Flow Computation [J] . S. Cuomo, A. Galletti, G. Giunta, Procedia Computer Science . 2015 ,第1期

机译：面向具有PetSC-CUDA的GPU集群上的多级并行框架，以进行基于PDE的光流计算
4. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [J] . Yuling Fang, Qingkui Chen, Neal N. Xiong, Sensors . 2017 ,第8期

机译：RGCA：基于有效性能-能量优化的可靠的GPU集群体系结构，用于大规模物联网计算
5. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters [C] . Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing . 2016

机译：大规模GPU集群上基于CUDA内核的集体约简操作
6. An MPI-CUDA implementation of a model for calcium induced calcium release in a three-dimensional heart cell on a hybrid CPU/GPU cluster [D] . Huang, Xuan 2015

机译：MPI-CUDA模型在混合CPU / GPU集群上的三维心脏细胞中钙诱导的钙释放的模型实现
7. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [O] . Yuling Fang, Qingkui Chen, Neal N. Xiong, 2017

机译：RGCA：基于有效性能-能源优化的可靠的GPU集群架构用于大规模物联网计算
8. High Performance Twitter Sentiment Analysis Using CUDA Based Distance Kernel on GPUs [O] . 2019

机译：高性能推特情绪分析使用基于CUDA的GPU距离内核

相关主题

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

摘要

著录项

相似文献

相关主题

期刊订阅