首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters
【24h】

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

机译:大规模GPU集群上基于CUDA内核的集体约简操作

获取原文

摘要

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.
机译:像NVIDIA GPU这样的加速器已经在很大程度上改变了当前HPC集群的格局。这些加速器提供的大规模异构并行性已导致可识别GPU的MPI库被广泛用于编写分布式并行科学应用程序。像MPI_Reduce这样的面向计算的集合运算除了对集合进行的常规通信外,还对数据执行计算。从历史上看,由于它们的计算要求,这些集合仅在CPU(或主机)上实现。但是,随着GPU技术的出现,MPI库为其基于GPU(或设备)的版本提供更好的设计已经变得很重要。在本文中,我们解决了上述挑战,并为GPU群集的最常用的面向计算的集合(MPI_Reduce,MPI_Allreduce和MPI_Scan)提供了设计和实现。我们提出了对最新算法的扩展,以充分利用GPUDirect RDMA(GDR)和CUDA计算内核之类的GPU功能来有效地执行这些操作。通过我们的新设计,我们报告减少了多达96个GPU的所有基于计算的集合的执行时间。实验结果表明,使用MPI_Reduce可使小消息提高50%,大消息提高85%。对于MPI_Allreduce和MPI_Scan,我们报告大邮件的时间减少了40%以上。此外,开发并评估了分析模型,以了解和预测针对超大规模GPU集群的拟议设计的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号