Large-Scale FFT on GPU Clusters

机译：GPU群集上的大规模FFT

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

A GPU cluster is a cluster equipped with GPU devices. Excellent acceleration is achievable for computation-intensive tasks (e.g. matrix multiplication and LINPACK) and bandwidth-intensive tasks with data locality (e.g. finite-difference simulation). Bandwidth-intensive tasks such as large-scale FFTs without data locality are harder to accelerate, as the bottleneck often lies with the PCI between main memory and GPU device memory or the communication network between workstation nodes. That means optimizing the performance of FFT for a single GPU device will not improve the overall performance. This paper uses large-scale FFT as an example to show how to achieve substantial speedups for these more challenging tasks on a GPU cluster. Three GPU-related factors lead to better performance: firstly the use of GPU devices improves the sustained memory bandwidth for processing large-size data; secondly GPU device memory allows larger subtasks to be processed in whole and hence reduces repeated data transfers between memory and processors; and finally some costly main-memory operations such as matrix transposition can be significantly sped up by GPUs if necessary data adjustment is performed during data transfers. This technique of manipulating array dimensions during data transfer is the main technical contribution of this paper. These factors (as well as the improved communication library in our implementation) attribute to 24.3x speedup with respect to FFTW and 7x speedup with respect to Intel MKL for 4096 3D single-precision FFT on a 16-node cluster with 32 GPUs. Around 5x speedup with respect to both standard libraries are achieved for double precision.

机译：GPU群集是配备有GPU设备的群集。对于计算密集型任务（例如矩阵乘法和LINPACK）和带宽密集型任务（具有数据局部性）（例如有限差分仿真），可以实现出色的加速。带宽密集型任务（例如没有数据局部性的大规模FFT）很难加速，因为瓶颈通常在于主内存和GPU设备内存之间的PCI或工作站节点之间的通信网络。这意味着针对单个GPU设备优化FFT的性能不会改善整体性能。本文以大规模FFT为例，说明如何在GPU集群上针对这些更具挑战性的任务实现显着的加速。与GPU相关的三个因素导致了更好的性能：首先，GPU设备的使用提高了用于处理大数据的持续内存带宽。其次，GPU设备内存允许整体上处理较大的子任务，从而减少内存和处理器之间的重复数据传输。最后，如果在数据传输过程中进行必要的数据调整，GPU可以显着加快一些昂贵的主内存操作，例如矩阵转置。这种在数据传输过程中操纵数组维数的技术是本文的主要技术贡献。这些因素（以及我们实施中改进的通信库）在具有32个GPU的16节点集群上，对于4096W 3D单精度FFT，相对于FFTW加快了24.3倍，对于Intel MKL加快了7倍。相对于两个标准库，实现了约5倍的加速，从而实现了双精度。

著录项

来源
《24th ACM international conference on supercomputing 2010》|2010年|p.315-324|共10页
会议地点 Amsterdam(NL);Amsterdam(NL)
作者
Yifeng Chen; Xiang Cui; Hong Mei;
展开▼
作者单位

HCST Key Lab at School of EECS Peking University Beijing 100871, China;

rnHCST Key Lab at School of EECS Peking University Beijing 100871, China;

rnHCST Key Lab at School of EECS Peking University Beijing 100871, China;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
FFT; GPU clusters; array dimensions;

机译：FFT; GPU集群；阵列尺寸;

相似文献

外文文献
中文文献
专利

1. Introducing ToPe-FFT: An OpenCL-based FFT library targeting GPUs [J] . Bilal Jan, Fiaz Gul Khan, Bartolomeo Montrucchio, Concurrency, practice and experience . 2017,第21期

机译：ToPe-FFT简介：针对GPU的基于OpenCL的FFT库
2. Solving Poisson's equation using FFT in a GPU cluster [J] . Jose L. Jodra, Ibai Gurrutxaga, Javier Muguerza, Journal of Parallel and Distributed Computing . 2017,第Apra期

机译：使用FFT在GPU集群中解决泊松等式
3. Large-Scale Pairwise Sequence Alignments on a Large-Scale GPU Cluster [J] . Design & Test,IEEE . 2014,第1期

机译：大型GPU集群上的大规模成对序列比对
4. Large-Scale FFT on GPU Clusters [C] . Yifeng Chen, Xiang Cui, Hong Mei ACM international conference on supercomputing . 2010

机译：GPU集群上的大型FFT
5. An Approach for Large-Scale Three-Dimensional FFT-Based Approximate Convolutions on GPUs [D] . Kulkarni, Anuva Abhijit. 2020

机译：GPU大规模三维FFT近似卷积的方法
6. RGCA: A Reliable GPU Cluster Architecture for Large-Scale Internet of Things Computing Based on Effective Performance-Energy Optimization [O] . Yuling Fang, Qingkui Chen, Neal N. Xiong, 2017

机译：RGCA：基于有效性能-能源优化的可靠的GPU集群架构用于大规模物联网计算
7. Large-Scale Spatial Data Processing on GPUs and GPU-Accelerated Clusters [O] . Jianting Zhang, Simin You, Le Gruenwald 2015

机译：GpU和GpU加速集群上的大规模空间数据处理

Large-Scale FFT on GPU Clusters

摘要

著录项

相似文献

相关主题

期刊订阅