首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Multi-GPU Parallelization of the NAS Multi-Zone Parallel Benchmarks
【24h】

Multi-GPU Parallelization of the NAS Multi-Zone Parallel Benchmarks

机译:NAS多区并行基准的多GPU并行化

获取原文
获取原文并翻译 | 示例

摘要

GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC) domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However, exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPU execution. On one hand, the workload should be distributed evenly among the GPUs. On the other hand, communications between GPU devices are costly and should be minimized. Therefore, a trade-of between work-distribution schemes and communication overheads will condition the overall performance of parallel applications run on multi-GPU systems. In this article we present a multi-GPU implementation of NAS Multi-Zone Parallel Benchmarks (which execution alternate communication and computational phases). We propose several work-distribution strategies that try to evenly distribute the workload among the GPUs. Our evaluations show that performance is highly sensitive to this distribution strategy, as the the communication phases of the applications are heavily affected by the work-distribution schemes applied in computational phases. In particular, we consider Static, Dynamic, and Guided schedulers to find a trade-off between both phases to maximize the overall performance. In addition, we compare those schedulers with an optimal scheduler computed offline using IBM CPLEX. On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our multi-GPU parallelization outperforms single-GPU execution from 1.48x to 1.86x (2 GPUs) and from 1.75x to 3.54x (4 GPUs). This article analyses these improvements in terms of the relationship between the computational and communication phases of the applications as the number of GPUs is increased. We prove that Guided schedulers perform at similar level as optimal schedulers.
机译:基于GPU的计算系统已成为高性能计算(HPC)域的广泛接受的解决方案。 GPU已经显示出高竞争性的性能,可以利用令人惊讶的平行水平。然而,利用这种设备的峰值性能是一项挑战,主要是由于多GPU执行的两个基本方面的组合。一方面,工作负载应在GPU之间均匀分布。另一方面,GPU设备之间的通信成本高昂,应最小化。因此,工作分配方案和通信开销之间的贸易将根据多GPU系统运行的并行应用程序的整体性能。在本文中,我们介绍了NAS多区并行基准的多GPU实现(执行备用通信和计算阶段)。我们提出了几项工作分配策略,试图均匀地分配GPU之间的工作量。我们的评估表明,由于该应用的通信阶段受到计算阶段的工作分配方案的影响,性能对该分配策略非常敏感。特别是,我们考虑静态,动态和导向调度程序,以在两个阶段之间找到权衡,以最大限度地提高整体性能。此外,我们将使用IBM CPLEX进行最佳调度程序使用最佳调度程序进行比较。在由2 x IBM Power9 8335-GTH和4 x GPU NVIDIA V100(Volta)组成的评估环境中,我们的多GPU并行化优于148倍至1.86倍(2 GPU)的单GPU执行,1.75倍至3.54倍(4个GPU)。本文根据GPU的数量增加,分析了应用程序的计算和通信阶段之间的关系方面的改进。我们证明导向调度程序以相似的水平作为最佳调度员执行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号