This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
展开▼
机译:本文介绍了CUDA将HIMENO基准与GPU的群体加速。 实现旨在优化内存带宽利用率。 我们的方法在NVIDIA Tesla C1060 GPU上实现了超过83%的理论峰值带宽,并在50多个GFLOPS上进行。 利用MPI与CUDA流与数据传输执行的多GPU实现,利用CUDA流与数据传输执行允许线性缩放并在具有16个GPU的群集中以超过800 GFLOPS执行。 本文提出了实现这种性能水平所需的优化。
展开▼