首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems
【24h】

Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-Enabled Systems

机译:利用最大重叠量在现代启用GPU的系统上进行非连续数据移动处理

获取原文

摘要

GPU accelerators are widely used in HPC clusters due to their massive parallelism and high throughput-per-watt. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. CUDA-Aware MPI libraries optimize the non-contiguous data movement processing using latency oriented techniques such as using GPU kernels to accelerate the packing/unpacking operations. Although they optimize the latency of a single operation, the inherent restrictions of the designs limit their efficiency for throughput oriented patterns. Indeed, none of the existing designs fully exploit the massive parallelism of the GPUs to provide high throughput and efficient resources utilization by enabling maximal overlap. In this paper, we propose novel designs for CUDA-Aware MPI libraries to achieve efficient GPU resource utilization and maximal overlap between CPUs and GPUs for non-contiguous data processing and movement. The proposed designs take advantage of several CUDA features, such as Hyper-Q/multi-streams and callback function, to deliver high performance and efficiency. To the best of our knowledge, this is the first such study to provide high throughput and efficient resource utilization for non-contiguous MPI data processing and movement to/from GPUs. The performance evaluation with the proposed designs using DDTBench shows up to 54%, 67%, 61% performance improvement on the SPECFEM3D_oc, SPECFEM3D_cm and WRF_y_sa benchmarks respectively for intra-node inter-GPU ping-pong experiments. The proposed designs also deliver up to 33% improvement on the total execution time over the existing designs for the HaloExchange-based application kernel that models the communication pattern of the MeteoSwiss weather forecasting model over 32 GPU nodes on Wilkes GPU cluster.
机译:GPU加速器因其大规模的并行性和高每瓦吞吐量而被广泛用于HPC集群中。数据移动仍然是GPU集群的主要瓶颈,尤其是在数据不连续的情况下(这在科学应用中很常见)。 CUDA感知的MPI库使用面向延迟的技术(例如,使用GPU内核来加速打包/拆包操作)来优化非连续数据移动处理。尽管它们优化了单个操作的等待时间,但是设计的固有限制限制了它们在面向吞吐量的模式方面的效率。确实,现有设计都没有充分利用GPU的大规模并行性来实现最大的重叠,从而提供高吞吐量和有效的资源利用。在本文中,我们为CUDA感知MPI库提出了新颖的设计,以实现有效的GPU资源利用以及CPU和GPU之间最大的重叠,以实现非连续的数据处理和移动。拟议的设计利用了一些CUDA功能(例如Hyper-Q /多流和回调功能)来提供高性能和效率。据我们所知,这是第一个为非连续MPI数据处理以及往返于GPU的移动提供高吞吐量和有效资源利用的研究。使用DDTBench的建议设计进行的性能评估显示,分别针对节点内GPU间乒乓实验,SPECFEM3D_oc,SPECFEM3D_cm和WRF_y_sa基准分别提高了54%,67%,61%。与基于HaloExchange的应用程序内核的现有设计相比,拟议的设计还使总执行时间缩短了33%,后者可对Wilkes GPU集群上32个GPU节点上的MeteoSwiss天气预报模型的通信模式进行建模。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号