首页> 外文学位 >Exploiting Data-Parallelism in GPUs.
【24h】

Exploiting Data-Parallelism in GPUs.

机译:在GPU中利用数据并行性。

获取原文
获取原文并翻译 | 示例

摘要

Mainstream microprocessor design no longer delivers performance boosts by increasing the processor clock frequency due to power and thermal constraints. Nonetheless, advances in semiconductor fabrication still allow the transistor density to increase at the rate of Moore’s law. This has resulted in the proliferation of many-core parallel architectures and accelerators, among which GPUs (graphics processing unit) quickly established themselves as suitable for applications that exploit fine-grained data-parallelism. GPU clusters are starting to make inroads into the HPC (high performance computing) domain as well, due to much better power per Flop (floating point operation) performance than general-purpose processors such as CPUs.;Even though it is easier to program GPUs than ever, efficiently taking advantage of GPU resources requires unique techniques that are not found elsewhere. The traditional function level task-parallelism can hardly provide enough optimization opportunities for such parallel architectures. Instead, it is crucial to extract data-parallelism and map it to the massive threading execution model advocated by GPUs.;This dissertation consists of multiple efforts to build programming models above existing models (CUDA) for single GPUs as well as GPU clusters. We start from manually implementing a flocking-based document clustering algorithm on GPU clusters. With this first-hand experience to write code directly above CUDA and MPI (message passing interface), we make several key observations: (1) Unified memory interface greatly enhances programmability, especially in GPU cluster environment, (2) explicit expression of data parallelism at language level facilitates the mapping of algorithms to massively parallel architectures and (3) auto-tuning is necessary to achieve competitive performance as the parallel architecture becomes more complex.;Based on these observations, we propose several programming models and compiler approaches to achieve portability and programmability while retaining as much performance as possible. • We propose GStream, a general-purpose, scalable data streaming framework on GPUs. We project powerful, yet concise language abstractions onto GPUs to fully exploit their inherent massive data-parallelism. • We take a domain specific language approach to provide an efficient implementation of 3D iterative stencil computations on GPUs with auto-tuning capabilities. • We propose CuNesl, a compiler framework to translate and optimize a nested data-parallel language called NESL into parallel CUDA programs for SIMT architectures. By converting recursive calls into while loops, we ensure that the hierarchical execution model in GPUs can be exploited on the “flattened” code. • Finally, we design HiDP, a hierarchical data-parallel language suitable for hierarchical features of microprocessor architectures. We then develop a source-to-source compiler that converts HiDP into CUDA C++ source code with tuning capability. It greatly improves coding productivity while still keeping up with the performance of hand-coded CUDA code.;The methods above cover a wide range of techniques for GPGPU computing and represent the current technology trend to exploit data parallelism in state-of-the-art GPUs.
机译:由于电源和散热的限制,主流微处理器设计不再通过提高处理器时钟频率来提高性能。尽管如此,半导体制造的进步仍然允许晶体管密度以摩尔定律的速率增加。这导致了许多核并行架构和加速器的泛滥,其中GPU(图形处理单元)迅速建立了自己的地位,非常适合利用细粒度数据并行性的应用程序。 GPU集群也开始涉足HPC(高性能计算)领域,这是因为每个Flop(浮点运算)性能的功耗要比CPU等通用处理器好得多;即使对GPU进行编程更容易与以往相比,有效利用GPU资源需要在其他地方找不到的独特技术。传统的功能级任务并行性很难为此类并行体系结构提供足够的优化机会。取而代之的是,提取数据并行性并将其映射到GPU倡导的大规模线程执行模型至关重要。本论文包括在单一GPU和GPU集群的现有模型(CUDA)之上构建编程模型的多项工作。我们从在GPU群集上手动实现基于植绒的文档群集算法开始。凭借直接在CUDA和MPI(消息传递接口)之上编写代码的第一手经验,我们得出了以下主要结论:(1)统一内存接口极大地增强了可编程性,尤其是在GPU集群环境中;(2)数据并行性的显式表达在语言级别上,这有助于将算法映射到大规模并行体系结构,并且(3)随着并行体系结构变得更加复杂,自动调整对于实现竞争性能是必不可少的。;基于这些观察,我们提出了几种编程模型和编译器方法来实现可移植性和可编程性,同时保留尽可能多的性能。 •我们提出了GStream,这是GPU上的通用,可扩展的数据流框架。我们将强大而简洁的语言抽象投影到GPU上,以充分利用它们固有的海量数据并行性。 •我们采用领域特定的语言方法,以通过自动调整功能在GPU上有效地实现3D迭代模板计算。 •我们建议使用CuNesl,这是一个编译器框架,用于将嵌套的并行数据并行语言NESL转换和优化为SIMT体系结构的并行CUDA程序。通过将递归调用转换为while循环,我们确保可以在“扁平化”代码上利用GPU中的分层执行模型。 •最后,我们设计HiDP,一种适用于微处理器体系结构的分层功能的分层数据并行语言。然后,我们开发一个源到源编译器,该编译器将HiDP转换为具有调整功能的CUDA C ++源代码。它大大提高了编码效率,同时仍能保持手动编码的CUDA代码的性能。上面的方法涵盖了GPGPU计算的广泛技术,并代表了当前利用最新技术并行处理数据的技术趋势GPU。

著录项

  • 作者

    Zhang, Yongpeng.;

  • 作者单位

    North Carolina State University.;

  • 授予单位 North Carolina State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 141 p.
  • 总页数 141
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号