首页> 外文学位 >Optimization techniques for mapping algorithms and applications onto CUDA GPU platforms and CPU-GPU heterogeneous platforms.
【24h】

Optimization techniques for mapping algorithms and applications onto CUDA GPU platforms and CPU-GPU heterogeneous platforms.

机译:用于将算法和应用程序映射到CUDA GPU平台和CPU-GPU异构平台的优化技术。

获取原文
获取原文并翻译 | 示例

摘要

An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms.;The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT.;We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480.;We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU.
机译:处理器体系结构的一种新兴趋势似乎表明,在时钟速度相同或降低的情况下,每芯片内核数每两年增加一倍。本论文特别感兴趣的是多核处理器,由于其高性能,低成本和低功耗而变得越来越有吸引力。本文的主要目的是开发用于将算法和应用程序映射到CUDA GPU和CPU-GPU异构平台上的优化技术。快速傅里叶变换(FFT)构成了计算科学和工程学的基本工具,因此对GPU进行了优化。实施至关重要。我们首先研究了3D FFT在最新的CUDA GPU上的映射,并开发了一种新的方法,该方法可以最大程度地减少全局内存访问的数量,并使沿不同维度的计算重叠。我们获得了一些最快的用于多维FFT计算的已知实现。然后,我们提出了一种基于FFT的高度多线程的直接泊松求解器,该求解器针对最近的NVIDIA GPU进行了优化。除了大量的多线程外,我们的算法还仔细管理内存层次结构的多层,以便将所有全局内存访问合并为128字节的设备内存事务。结果,我们在GTX 480上实现了高达375GFLOPS的带宽和120GB / s的带宽。我们进一步扩展了方法,以应对输入量太大而无法容纳基于CPU-GPU的异构平台的情况。 GPU全局内存。我们为内存绑定和计算绑定应用程序开发优化技术。这里的主要挑战是最大程度地减少CPU内存和设备内存之间的数据传输,并使这些传输与内核执行尽可能重叠。对于内存有限的应用,我们实现了接近峰值的有效PCIe总线带宽,9-10GB / s,以及高达145 GFLOPS的性能,可用于多维FFT计算和求解泊松方程。我们将基于CPU-GPU的软件管道扩展到计算绑定的应用程序-DGEMM,并实现了与纯GPU相似的CPU内存大小和计算吞吐量的内存错觉。

著录项

  • 作者

    Wu, Jing.;

  • 作者单位

    University of Maryland, College Park.;

  • 授予单位 University of Maryland, College Park.;
  • 学科 Engineering Computer.;Computer Science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 179 p.
  • 总页数 179
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号