Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Jin Wang; Norm Rubin; Albert Sidelnik; Sudhakar Yalamanchili

首页> 外文期刊>Computer architecture news >Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

【24h】

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

机译：动态线程块启动：支持GPU上不规则应用程序的轻量级执行机制

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

GPUs have been proven effective for structured application-s that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.

机译：GPU已被证明对结构化应用程序有效，可以很好地映射到现代批量同步并行（BSP）编程语言中的刚性1D-3D线程网格。但是，在映射数据密集型不规则应用程序（例如图形分析，关系数据库和机器学习）时，遇到的成功较少。最近在GPU中引入的嵌套的设备侧内核启动功能是朝正确方向迈出的一步，但仍未能有效利用GPU的性能潜力。我们提出一种称为动态线程块启动（DTBL）的新机制，以通过支持轻量级线程块的动态生成来扩展当前GPU执行模型基础上的当前批量同步并行模型。这种机制支持嵌套启动线程块，而不是内核，以执行动态发生的并行工作元素。本文介绍了DTBL的执行模型，设备运行时支持以及微体系结构扩展，以跟踪和执行动态产生的线程块。对一组在周期级模拟器上执行的不规则数据密集型CUDA应用程序进行的实验表明，DTBL的速度比原始平面实现平均提高1.21倍，而使用CUDA动态并行技术启动的设备侧内核则实现了平均1.40倍。

著录项

来源
《Computer architecture news》 |2015年第3期|528-540|共13页
作者
Jin Wang; Norm Rubin; Albert Sidelnik; Sudhakar Yalamanchili;
展开▼
作者单位

Georgia Institute of Technology;

NVIDIA Research;

NVIDIA Research;

Georgia Institute of Technology;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Adjusting Thread Parallelism Dynamically to Accelerate Dynamic Programming with Irregular Workload Distribution on GPGPUs [J] . Chao-Chin Wu, Jenn-Yang Ke, Heshan Lin, International journal of grid and high performance computing . 2014,第1期

机译：动态调整线程并行度以加快GPGPU上不规则工作负载分布的动态编程
2. Massively Parallel Rule-Based Interpreter Execution on GPUs Using Thread Compaction [J] . M. Koester, J. Gross, A. Krueger International journal of parallel programming . 2020,第4期

机译：使用线程压缩在GPU上基于基于规则的基于规则的解释器执行
3. Adaptive executions of hyperbolic block-structured AMR applications on GPU systems [J] . Raghavan Hari K., Vadhiyar Sathish S. Experimental Mechanics . 2015,第2期

机译：在GPU系统上自适应执行双曲线块结构AMR应用程序
4. Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs [C] . Wang Jin, Rubin Norm, Sidelnik Albert, 42th Annual International Symposium on Computer Architecture . 2015

机译：动态线程块启动：一种轻量级的执行机制，可支持GPU上的不规则应用程序
5. Characterizing Dynamic Frequency and Thread Blocking Scaling in GPUs: Challenges and Opportunit [D] . Chow, Marcus. 2018

机译：GPU中动态频率和线程阻塞缩放的特征：挑战与机遇
6. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs GPUs and MICs: A Case Study with Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Guilherme Andrade, -1

机译：具有多核CPUGPU和MIC的系统上的应用程序性能分析和高效执行：以显微镜图像分析为例
7. Figure 9: Execution time of the GPU-based parallel implementation of permutation testing according to parallel method (number of threads per block = 256). [O] . -1

机译：图9：根据并行方法（每个块= 256的线程数）的基于GPU的并行实现的执行时间。

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅