Parallel FFT program optimization on heterogeneous computers.

机译：异构计算机上的并行FFT程序优化。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Generating high performance Fast Fourier Transform (FFT) library is an important research topic for the traditional processors, CPUs, and new accelerators, like Graphics Processing Units (GPUs). It is not rare that large scientific and engineering computation, such as physics simulations, signal processing and data compression, spend majority of execution time on large size FFTs. Such FFT implementations require large amount of computing resources and memory bandwidth.;On the system side, in spite of highly influential results in prior FFT work on GPUs, the GPU performance is severely restricted by the limited memory size and the low bandwidth of data transfer through PCI channel. Additionally, current GPU based FFT implementation only uses GPU to compute, but employs CPU as a mere memory-transfer controller. The computing power of CPUs is wasted. On the algorithmic side, input signals are frequently sparse. If we know that an input is sparse, the computational complexity of FFT can be reduced. Many sparse FFT algorithms have been proposed to improve sparse FFT's efficiency. However, the existing sparse FFT implementations are confined to serial execution and are input oblivious in the sense that how the algorithms work is not affected by input characteristics.;In this dissertation, we present two high performance optimization strategies. First, we study the problems of current GPU based FFT implementations, and propose a hybrid approach for 2D and 3D FFT, which concurrently executes both multithreaded CPU and GPU in a heterogeneous computer to accelerate large FFT problems that cannot fit into GPU memory. Within the scheme, an empirical performance modeling is constructed to determine optimal load balancing between CPU and GPU, and an optimizer is proposed to exploit substantial parallelism for both GPU and CPUs and to overlap communication with computation. Second, we investigate the existing sparse FFT algorithms and propose an input adaptive model for algorithmic parallelization. In particular, the algorithm takes advantage of the similarity between input samples to save much computation and to exploit substantial data parallelism. The solution has runtime sub-linear to the input size and gets rid of coefficient estimation's dependencies, both of which improve parallelism and performance.

机译：对于传统处理器，CPU和诸如图形处理单元（GPU）之类的新加速器而言，生成高性能快速傅立叶变换（FFT）库是一个重要的研究课题。大型科学和工程计算（例如物理模拟，信号处理和数据压缩）在大型FFT上花费大部分执行时间并不罕见。这样的FFT实现需要大量的计算资源和内存带宽。;在系统方面，尽管先前FFT在GPU上的工作产生了很大的影响，但由于内存大小有限和数据传输的低带宽而严重限制了GPU性能通过PCI通道。此外，当前基于GPU的FFT实现仅使用GPU进行计算，但仅将CPU用作内存传输控制器。 CPU的计算能力被浪费了。在算法方面，输入信号经常是稀疏的。如果我们知道输入稀疏，则可以降低FFT的计算复杂度。已经提出了许多稀疏FFT算法来提高稀疏FFT的效率。但是，现有的稀疏FFT实现方式仅限于串行执行，并且在算法工作不受输入特性影响的意义上是忽略输入的。本文提出了两种高性能的优化策略。首先，我们研究当前基于GPU的FFT实现的问题，并提出一种用于2D和3D FFT的混合方法，该方法在异构计算机中同时执行多线程CPU和GPU，以加速无法容纳在GPU内存中的大型FFT问题。在该方案中，构建了一个经验性能模型来确定CPU和GPU之间的最佳负载平衡，并提出了一个优化程序来利用GPU和CPU的实质并行性并使通信与计算重叠。其次，我们研究现有的稀疏FFT算法，并提出用于算法并行化的输入自适应模型。特别地，该算法利用输入样本之间的相似性来节省大量计算并充分利用数据并行性。该解决方案具有与输入大小成线性关系的运行时，并且摆脱了系数估计的依赖关系，两者均提高了并行性和性能。

著录项

作者
Chen, Shuo.;
展开▼
作者单位

University of Delaware.;

展开▼
授予单位 University of Delaware.;
学科 Computer engineering.
学位 Ph.D.
年度 2015
页码 152 p.
总页数 152
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Energy optimization of parallel programs in a heterogeneous system by combining processor core-shutdown and dynamic voltage scaling [J] . Zhuowei Wang, Hao Wang, Wuqing Zhao, Future generation computer systems . 2019,第MARa期

机译：通过结合处理器内核关闭和动态电压缩放功能，在异构系统中并行程序的能量优化
2. Optimizing Process Allocation Of Parallel Programs For Heterogeneous Clusters [J] . Shuichi Ichikawa, Sho Takahashi, Yuu Kawai Concurrency and Computation . 2009,第4期

机译：为异构集群优化并行程序的进程分配
3. Design optimisation of multiplier-free parallel pipelined FFT on field programmable gate array [J] . Godi Prasanna Kumar, Krishna Battula Tirumala, Kotipalli Pushpa Circuits, Devices & Systems, IET . 2020,第7期

机译：现场可编程门阵列上乘法平行管道FFT的设计优化
4. Optimizing ELARS Algorithms Using NVIDIA CUDA Heterogeneous Parallel Programming Platform [C] . Vedran Mileti?, Martina Holenko Dlab, Nata?a Hoi?-Bo?i? ICT Innovations Conference . 2015

机译：使用NVIDIA CUDA异构并行编程平台优化ELARS算法
5. A programming model and processor architecture for heterogeneous multicore computers. [D] . Linderman, Michael David. 2009

机译：异构多核计算机的编程模型和处理器体系结构。
6. Parallelisation of equation-based simulation programs on heterogeneous computing systems [O] . Dragan D. Nikolić 2018

机译：基于等式的异构计算系统仿真程序的平行化
7. ParallelStructure: a R package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers. [O] . Francois Besnier, Kevin A Glover 2013

机译：parallelstructure：一个R包，用于在多核计算机上分配群体遗传程序sTRUCTURE的并行运行。

Parallel FFT program optimization on heterogeneous computers.

摘要

著录项

相似文献

相关主题

期刊订阅