首页> 外文学位 >Parallel FFT program optimization on heterogeneous computers.
【24h】

Parallel FFT program optimization on heterogeneous computers.

机译:异构计算机上的并行FFT程序优化。

获取原文
获取原文并翻译 | 示例

摘要

Generating high performance Fast Fourier Transform (FFT) library is an important research topic for the traditional processors, CPUs, and new accelerators, like Graphics Processing Units (GPUs). It is not rare that large scientific and engineering computation, such as physics simulations, signal processing and data compression, spend majority of execution time on large size FFTs. Such FFT implementations require large amount of computing resources and memory bandwidth.;On the system side, in spite of highly influential results in prior FFT work on GPUs, the GPU performance is severely restricted by the limited memory size and the low bandwidth of data transfer through PCI channel. Additionally, current GPU based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller. The computing power of CPUs is wasted. On the algorithmic side, input signals are frequently sparse. If we know that an input is sparse, the computational complexity of FFT can be reduced. Many sparse FFT algorithms have been proposed to improve sparse FFT's efficiency. However, the existing sparse FFT implementations are confined to serial execution and are input oblivious in the sense that how the algorithms work is not affected by input characteristics.;In this dissertation, we present two high performance optimization strategies. First, we study the problems of current GPU based FFT implementations, and propose a hybrid approach for 2D and 3D FFT, which concurrently executes both multithreaded CPU and GPU in a heterogeneous computer to accelerate large FFT problems that cannot fit into GPU memory. Within the scheme, an empirical performance modeling is constructed to determine optimal load balancing between CPU and GPU, and an optimizer is proposed to exploit substantial parallelism for both GPU and CPUs and to overlap communication with computation. Second, we investigate the existing sparse FFT algorithms and propose an input adaptive model for algorithmic parallelization. In particular, the algorithm takes advantage of the similarity between input samples to save much computation and to exploit substantial data parallelism. The solution has runtime sub-linear to the input size and gets rid of coefficient estimation's dependencies, both of which improve parallelism and performance.
机译:对于传统处理器,CPU和诸如图形处理单元(GPU)之类的新加速器而言,生成高性能快速傅立叶变换(FFT)库是一个重要的研究课题。大型科学和工程计算(例如物理模拟,信号处理和数据压缩)在大型FFT上花费大部分执行时间并不罕见。这样的FFT实现需要大量的计算资源和内存带宽。;在系统方面,尽管先前FFT在GPU上的工作产生了很大的影响,但由于内存大小有限和数据传输的低带宽而严重限制了GPU性能通过PCI通道。此外,当前基于GPU的FFT实现仅使用GPU进行计算,但仅将CPU用作内存传输控制器。 CPU的计算能力被浪费了。在算法方面,输入信号经常是稀疏的。如果我们知道输入稀疏,则可以降低FFT的计算复杂度。已经提出了许多稀疏FFT算法来提高稀疏FFT的效率。但是,现有的稀疏FFT实现方式仅限于串行执行,并且在算法工作不受输入特性影响的意义上是忽略输入的。本文提出了两种高性能的优化策略。首先,我们研究当前基于GPU的FFT实现的问题,并提出一种用于2D和3D FFT的混合方法,该方法在异构计算机中同时执行多线程CPU和GPU,以加速无法容纳在GPU内存中的大型FFT问题。在该方案中,构建了一个经验性能模型来确定CPU和GPU之间的最佳负载平衡,并提出了一个优化程序来利用GPU和CPU的实质并行性并使通信与计算重叠。其次,我们研究现有的稀疏FFT算法,并提出用于算法并行化的输入自适应模型。特别地,该算法利用输入样本之间的相似性来节省大量计算并充分利用数据并行性。该解决方案具有与输入大小成线性关系的运行时,并且摆脱了系数估计的依赖关系,两者均提高了并行性和性能。

著录项

  • 作者

    Chen, Shuo.;

  • 作者单位

    University of Delaware.;

  • 授予单位 University of Delaware.;
  • 学科 Computer engineering.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 152 p.
  • 总页数 152
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号