首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs
【24h】

cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs

机译:Cusfft:GPU上的高性能稀疏快速傅里叶变换算法

获取原文

摘要

The Fast Fourier Transform (FFT) is one of the most important numerical tools widely used in many scientific and engineering applications. The algorithm performs O(nlogn) operations on n input data points in order to calculate only small number of k large coefficients, while the rest of n - k numbers are zero or negligibly small. The algorithm is clearly inefficient, when n points input data lead to only k <;Z n non-zero coefficients in the transformed domain. MIT in 2012 developed a sparse FFT (sFFT) algorithm that provides a solution to this problem. In this paper, we explore the challenges and propose effective solutions to efficiently port sFFT to massively parallel processors, such as GPUs, using CUDA. GPGPUs are being increasingly adopted as popular HPC platforms because of their tremendous computing power and remarkable cost efficiency. However, sFFT algorithm is a complex and computationally challenging memory-bound algorithm that is not straightforward to be implemented on GPUs. In this paper, we present some of the optimization strategies such as index coalescing, loop splitting, asynchronous data layout transformation, linear time selection algorithm that are required to compute sFFT on such massively parallel architectures. Our CUDA-based sFFT, cusFFT, performs over 10x faster than the state-of-the-art cuFFT library on GPUs and over 28x faster than the parallel FFTW on multicore CPUs.
机译:快速的傅里叶变换(FFT)是许多科学和工程应用中广泛应用于最重要的数值工具之一。该算法在n个输入数据点上执行O(nlogn)操作,以便仅计算少量k大系数,而N-k编号的其余部分是零或忽略的小。当N点输入数据导致变换域中的k <; z n非零系数时,该算法显然是低效的。 2012年的麻省理工学院开发了一种稀疏的FFT(SFFT)算法,为此问题提供了解决方案。在本文中,我们探讨了挑战,并提出了有效的解决方案,以便使用CUDA将SFFT与大规模平行处理器(如GPU)有效地端口SFFT。由于其巨大的计算能力和显着的成本效率,GPGPU越来越多地被作为流行的HPC平台采用。然而,SFFT算法是一种复杂的和计算挑战的内存绑定算法,其在GPU上不必直接实现。在本文中,我们介绍了一些优化策略,如索引聚结,环形分裂,异步数据布局转换,在这种大规模并行架构上计算SFFT所需的线性时间选择算法。我们的CUDA系列Cusfft,Cusfft,比GPU上的最先进的袖扣库更快地执行超过10倍,而不是多核CPU上的并行FFTW速度超过28倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号