首页> 外文期刊>Computer architecture news >Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs
【24h】

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

机译:动态线程块启动:支持GPU上不规则应用程序的轻量级执行机制

获取原文
获取原文并翻译 | 示例

摘要

GPUs have been proven effective for structured application-s that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced nested device-side kernel launching functionality in the GPU is a step in the right direction, but still falls short of being able to effectively harness the GPUs performance potential. We propose a new mechanism called Dynamic Thread Block Launch (DTBL) to extend the current bulk synchronous parallel model underlying the current GPU execution model by supporting dynamic spawning of lightweight thread blocks. This mechanism supports the nested launching of thread blocks rather than kernels to execute dynamically occurring parallel work elements. This paper describes the execution model of DTBL, device-runtime support, and microarchitecture extensions to track and execute dynamically spawned thread blocks. Experiments with a set of irregular data intensive CUDA applications executing on a cycle-level simulator show that DTBL achieves average 1.21x speedup over the original flat implementation and average 1.40x over the implementation with device-side kernel launches using CUDA Dynamic Parallelism.
机译:GPU已被证明对结构化应用程序有效,可以很好地映射到现代批量同步并行(BSP)编程语言中的刚性1D-3D线程网格。但是,在映射数据密集型不规则应用程序(例如图形分析,关系数据库和机器学习)时,遇到的成功较少。最近在GPU中引入的嵌套的设备侧内核启动功能是朝正确方向迈出的一步,但仍未能有效利用GPU的性能潜力。我们提出一种称为动态线程块启动(DTBL)的新机制,以通过支持轻量级线程块的动态生成来扩展当前GPU执行模型基础上的当前批量同步并行模型。这种机制支持嵌套启动线程块,而不是内核,以执行动态发生的并行工作元素。本文介绍了DTBL的执行模型,设备运行时支持以及微体系结构扩展,以跟踪和执行动态产生的线程块。对一组在周期级模拟器上执行的不规则数据密集型CUDA应用程序进行的实验表明,DTBL的速度比原始平面实现平均提高1.21倍,而使用CUDA动态并行技术启动的设备侧内核则实现了平均1.40倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号