【24h】

Large-Scale FFT on GPU Clusters

机译:GPU群集上的大规模FFT

获取原文
获取原文并翻译 | 示例

摘要

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e.g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e.g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.
机译:GPU群集是配备有GPU设备的群集。对于计算密集型任务(例如矩阵乘法和LINPACK)和带宽密集型任务(具有数据局部性)(例如有限差分仿真),可以实现出色的加速。带宽密集型任务(例如没有数据局部性的大规模FFT)很难加速,因为瓶颈通常在于主内存和GPU设备内存之间的PCI或工作站节点之间的通信网络。这意味着针对单个GPU设备优化FFT的性能不会改善整体性能。本文以大规模FFT为例,说明如何在GPU集群上针对这些更具挑战性的任务实现显着的加速。与GPU相关的三个因素导致了更好的性能:首先,GPU设备的使用提高了用于处理大数据的持续内存带宽。其次,GPU设备内存允许整体上处理较大的子任务,从而减少内存和处理器之间的重复数据传输。最后,如果在数据传输过程中进行必要的数据调整,GPU可以显着加快一些昂贵的主内存操作,例如矩阵转置。这种在数据传输过程中操纵数组维数的技术是本文的主要技术贡献。这些因素(以及我们实施中改进的通信库)在具有32个GPU的16节点集群上,对于4096W 3D单精度FFT,相对于FFTW加快了24.3倍,对于Intel MKL加快了7倍。相对于两个标准库,实现了约5倍的加速,从而实现了双精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号