Exploiting Data-Parallelism in GPUs.

机译：在GPU中利用数据并行性。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Mainstream microprocessor design no longer delivers performance boosts by increasing the processor clock frequency due to power and thermal constraints. Nonetheless, advances in semiconductor fabrication still allow the transistor density to increase at the rate of Moore’s law. This has resulted in the proliferation of many-core parallel architectures and accelerators, among which GPUs (graphics processing unit) quickly established themselves as suitable for applications that exploit fine-grained data-parallelism. GPU clusters are starting to make inroads into the HPC (high performance computing) domain as well, due to much better power per Flop (floating point operation) performance than general-purpose processors such as CPUs.;Even though it is easier to program GPUs than ever, efficiently taking advantage of GPU resources requires unique techniques that are not found elsewhere. The traditional function level task-parallelism can hardly provide enough optimization opportunities for such parallel architectures. Instead, it is crucial to extract data-parallelism and map it to the massive threading execution model advocated by GPUs.;This dissertation consists of multiple efforts to build programming models above existing models (CUDA) for single GPUs as well as GPU clusters. We start from manually implementing a flocking-based document clustering algorithm on GPU clusters. With this first-hand experience to write code directly above CUDA and MPI (message passing interface), we make several key observations: (1) Unified memory interface greatly enhances programmability, especially in GPU cluster environment, (2) explicit expression of data parallelism at language level facilitates the mapping of algorithms to massively parallel architectures and (3) auto-tuning is necessary to achieve competitive performance as the parallel architecture becomes more complex.;Based on these observations, we propose several programming models and compiler approaches to achieve portability and programmability while retaining as much performance as possible. • We propose GStream, a general-purpose, scalable data streaming framework on GPUs. We project powerful, yet concise language abstractions onto GPUs to fully exploit their inherent massive data-parallelism. • We take a domain specific language approach to provide an efficient implementation of 3D iterative stencil computations on GPUs with auto-tuning capabilities. • We propose CuNesl, a compiler framework to translate and optimize a nested data-parallel language called NESL into parallel CUDA programs for SIMT architectures. By converting recursive calls into while loops, we ensure that the hierarchical execution model in GPUs can be exploited on the “flattened” code. • Finally, we design HiDP, a hierarchical data-parallel language suitable for hierarchical features of microprocessor architectures. We then develop a source-to-source compiler that converts HiDP into CUDA C++ source code with tuning capability. It greatly improves coding productivity while still keeping up with the performance of hand-coded CUDA code.;The methods above cover a wide range of techniques for GPGPU computing and represent the current technology trend to exploit data parallelism in state-of-the-art GPUs.

机译：由于电源和散热的限制，主流微处理器设计不再通过提高处理器时钟频率来提高性能。尽管如此，半导体制造的进步仍然允许晶体管密度以摩尔定律的速率增加。这导致了许多核并行架构和加速器的泛滥，其中GPU（图形处理单元）迅速建立了自己的地位，非常适合利用细粒度数据并行性的应用程序。 GPU集群也开始涉足HPC（高性能计算）领域，这是因为每个Flop（浮点运算）性能的功耗要比CPU等通用处理器好得多;即使对GPU进行编程更容易与以往相比，有效利用GPU资源需要在其他地方找不到的独特技术。传统的功能级任务并行性很难为此类并行体系结构提供足够的优化机会。取而代之的是，提取数据并行性并将其映射到GPU倡导的大规模线程执行模型至关重要。本论文包括在单一GPU和GPU集群的现有模型（CUDA）之上构建编程模型的多项工作。我们从在GPU群集上手动实现基于植绒的文档群集算法开始。凭借直接在CUDA和MPI（消息传递接口）之上编写代码的第一手经验，我们得出了以下主要结论：（1）统一内存接口极大地增强了可编程性，尤其是在GPU集群环境中；（2）数据并行性的显式表达在语言级别上，这有助于将算法映射到大规模并行体系结构，并且（3）随着并行体系结构变得更加复杂，自动调整对于实现竞争性能是必不可少的。；基于这些观察，我们提出了几种编程模型和编译器方法来实现可移植性和可编程性，同时保留尽可能多的性能。 •我们提出了GStream，这是GPU上的通用，可扩展的数据流框架。我们将强大而简洁的语言抽象投影到GPU上，以充分利用它们固有的海量数据并行性。 •我们采用领域特定的语言方法，以通过自动调整功能在GPU上有效地实现3D迭代模板计算。 •我们建议使用CuNesl，这是一个编译器框架，用于将嵌套的并行数据并行语言NESL转换和优化为SIMT体系结构的并行CUDA程序。通过将递归调用转换为while循环，我们确保可以在“扁平化”代码上利用GPU中的分层执行模型。 •最后，我们设计HiDP，一种适用于微处理器体系结构的分层功能的分层数据并行语言。然后，我们开发一个源到源编译器，该编译器将HiDP转换为具有调整功能的CUDA C ++源代码。它大大提高了编码效率，同时仍能保持手动编码的CUDA代码的性能。上面的方法涵盖了GPGPU计算的广泛技术，并代表了当前利用最新技术并行处理数据的技术趋势GPU。

著录项

作者
Zhang, Yongpeng.;
展开▼
作者单位

North Carolina State University.;

展开▼
授予单位 North Carolina State University.;
学科 Computer Science.
学位 Ph.D.
年度 2012
页码 141 p.
总页数 141
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Exploiting Data-Parallelism on Multicore and SMT Systems for Implementing the Fractal Image Compressing Problem [J] . Rodrigo da Rosa Righi, Vinicius F. Rodrigues, Cristiano A. Costa, Computer and information science . 2017,第1期

机译：利用多核和SMT系统上的数据并行性来实现分形图像压缩问题
2. Exploiting Application Data-Parallelism on Dynamically Reconfigurable Architectures: Placement and Architectural Considerations [J] . Banerjee S., Bozorgzadeh E., Dutt N. IEEE transactions on very large scale integration (VLSI) systems . 2009,第2期

机译：在动态可重配置架构上利用应用程序数据并行性：布局和架构注意事项
3. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald [J] . Romelia Salomon-Ferrer, Andreas W. Gotz, Duncan Poole Journal of chemical theory and computation: JCTC . 2013,第9期

机译：在GPU上使用AMBER进行常规的微秒分子动力学模拟。 2.显式溶剂粒子网格Ewald
4. An Efficient Implementation of a Subgraph Isomorphism Algorithm for GPUs. [C] . Vincenzo Bonnici, Rosalba Giugno, Nicola Bombieri IEEE International Conference on Bioinformatics and Biomedicine . 2018

机译：用于GPU的子图同构算法的有效实现。
5. Exploiting Parallelism in GPUs. [D] . Hechtman, Blake. 2014

机译：在GPU中利用并行性。
6. Routine Microsecond MolecularDynamics Simulationswith AMBER on GPUs. 1. Generalized Born [O] . AndreasW. Götz, Mark J. Williamson, ∥, -1

机译：常规微秒分子动力学模拟在GPU上使用AMBER。 1.广义出生
7. A Framework for Exploiting Task- and Data-Parallelism on Distributed Memory Multicomputers [O] . Shankar Ramaswamy, Sachin Sapatnekar 1997

机译：在分布式内存多计算机上利用任务和数据并行性的框架

Exploiting Data-Parallelism in GPUs.

摘要

著录项

相似文献

相关主题

期刊订阅