首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters
【24h】

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

机译:基于CUDA内核的大规模GPU集群集体减少操作

获取原文

摘要

Accelerators like NVIDIA GPUs have changed the landscape of current HPC clusters to a great extent. Massive heterogeneous parallelism offered by these accelerators have led to GPU-Aware MPI libraries that are widely used for writing distributed parallel scientific applications. Compute-oriented collective operations like MPI_Reduce perform computation on data in addition to the usual communication performed by collectives. Historically, these collectives, due to their compute requirements have been implemented on CPU (or Host) only. However, with the advent of GPU technologies it has become important for MPI libraries to provide better design for their GPU (or Device) based versions. In this paper, we tackle the above challenges and provide designs and implementations for most commonly used compute-oriented collectives - MPI_Reduce, MPI_Allreduce, and MPI_Scan - for GPU clusters. We propose extensions to the state-of-the-art algorithms to fully take advantage of the GPU capabilities like GPUDirect RDMA (GDR) and CUDA compute kernel to efficiently perform these operations. With our new designs, we report reduced execution time for all compute-based collectives up to 96 GPUs. Experimental results show an improvement of 50% for small messages and 85% for large messages using MPI_Reduce. For MPI_Allreduce and MPI_Scan, we report more than 40% reduction in time for large messages. Furthermore, analytical models are developed and evaluated to understand and predict the performance of proposed designs for extremely large-scale GPU clusters.
机译:像NVIDIA GPU这样的加速器已经在很大程度上改变了当前HPC集群的景观。这些加速器提供的大规模异构并行性导致GPU感知MPI库,广泛用于编写分布式并行科学应用。除了由集体执行的通常通信之外,MPI_Reduce等计算的集体集体操作还会对数据进行计算。从历史上看,由于它们的计算要求,这些集体已经在CPU(或主机)上实现了。然而,随着GPU技术的出现,对于MPI库来说,对于基于GPU(或设备)的版本来说,它已经重要。在本文中,我们解决了上述挑战,为最常用的计算导向集集团 - MPI_Reduce,MPI_allReduce和MPI_Scan提供了设计和实现 - 用于GPU集群。我们向最先进的算法提出扩展,充分利用GPudirect RDMA(GDR)和CUDA计算内核等GPU功能,以有效地执行这些操作。通过我们的新设计,我们报告了所有基于Compute的集体的执行时间,高达96 GPU。实验结果表明,对于使用MPI_REDUCE的大型信息,对小消息的提高和85%的提高。对于MPI_AllReduce和MPI_Scan,我们报告的时间超过40%以获得大型消息的时间。此外,开发和评估分析模型,以了解和预测极大的GPU集群的提出设计的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号