首页> 外文会议>24th ACM international conference on supercomputing 2010 >An Empirically Tuned 2D and 3D FFT Library on CUDA GPU
【24h】

An Empirically Tuned 2D and 3D FFT Library on CUDA GPU

机译:在CUDA GPU上根据经验调整的2D和3D FFT库

获取原文
获取原文并翻译 | 示例

摘要

In this paper, a Cooley-Tukey algorithm based multidimensional FFT computation framework on GPU is proposed. This framework generalizes the decomposition of multi-dimensional FFT on GPUs using an I/O tensor representation, and therefore provides a systematic description of possible FFT implementations on GPUs. The framework is geared to the efficiency of multi-dimensional FFT on GPU architectures. In particular, no global transposition among dimensions is performed and some previously unnoticed grouping and commutability of multiple dimensions are highlighted in order to reduce the number of computational kernels and minimize the number of global memory accesses. Important architectural factors and constraints of CUDA, such as coalesced access, bank conflicts and register pressure are also considered in this framework. Moreover, we adapt codelets, a straight-line style FFT implementation originally developed in FFTW, into our framework and prove that they are highly efficient on GPUs.rnA 2D and 3D FFT library, currently supporting power-of-two sizes, is implemented on this framework and empirically-tuned results are compared with CUFFT and other recent publications on three NVIDIA GPUs. On a high-end NVIDIA GPU, GeForce GTX280, our 2D implementation is 2.8x faster than CUFFT and 1.6x faster than the best previously published results on average. Our 3D FFT implementation achieves 22.7× speed up over CUFFT on average. Furthermore both implementations show better precision than CUFFT. This library and its framework are potentially extensible to more general FFT problem sizes and other parallel architectures as well.
机译:本文提出了一种基于Cooley-Tukey算法的GPU多维FFT计算框架。此框架使用I / O张量表示概括了GPU上多维FFT的分解,因此提供了对GPU上可能的FFT实现的系统描述。该框架旨在提高GPU架构上多维FFT的效率。特别地,不执行维度之间的全局转换,并且突出显示一些先前未被注意的多维维度的分组和可交换性,以减少计算内核的数量并最小化全局内存访问的数量。在此框架中还考虑了CUDA的重要架构因素和约束,例如合并访问,银行冲突和注册压力。此外,我们将小码(一种最初由FFTW开发的直线型FFT实现)改编到我们的框架中,并证明它们在GPU上高效.rnA 2D和3D FFT库目前支持2的幂次方。该框架和根据经验调整的结果与CUFFT和其他三个NVIDIA GPU上的最新出版物进行了比较。在高端NVIDIA GPU GeForce GTX280上,我们的2D实现平均比CUFFT快2.8倍,比以前发布的最佳结果平均快1.6倍。我们的3D FFT实现平均比CUFFT快22.7倍。此外,两种实现均显示出比CUFFT更好的精度。该库及其框架可能会扩展到更通用的FFT问题大小以及其他并行架构。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号