首页> 外文OA文献 >GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs
【2h】

GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs

机译:GpU-TLs:GpU上推测性循环并行化的高效运行时

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and reallife applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops. © 2013 IEEE.
机译:最近,GPU已成为HPC和云环境中通用应用程序的一种重要并行平台。由于特殊的执行模型,即使最近引入了诸如CUDA和OpenCL之类的高级语言,也很难为GPU开发程序。为了简化编程工作,一些研究提出了通过复杂的编译时技术自动生成并行GPU代码的方法。但是,这种方法只能并行化100%的循环,而没有迭代间的依赖关系(即DOALL循环)。为了利用无法通过静态分析证明的运行时并行性,在这项工作中,我们提出了GPU-TLS,这是一种运行时系统,用于对GPU上顺序程序中可能并行的循环进行推测性并行化。 GPU-TLS将可能的并行循环切成较小的子循环,从而将其并行化,每个子循环由GPU内核并行执行,并推测不存在迭代间的依赖关系。经过依赖性检查后,将缓冲的无误推测的迭代写操作复制到主存储器,同时重新执行遇到误推测的迭代。 GPU-TLS解决了GPU上的推测循环并行化的几个关键问题:(1)通过以下三种方法减少了由大量线程导致的更大的错误推测率:循环斩波并行化方法,延迟内存更新方案和内部扭曲价值转发方法。 (2)通过混合方案减少了依赖检查的较大开销:渴望的warp内部依赖检查与懒惰warp间依赖检查相结合。 (3)并行提交方案缓解了串行提交的瓶颈,该方案允许不同的迭代以无序方式进入提交阶段,但仍保证了顺序语义。在两张最新的NVIDIA GPU卡上同时使用微基准和实际应用程序进行的广泛评估表明,对于可能具有并行循环的顺序程序,使用GPU-TLS进行推测性循环并行化可以将速度提高5到160。 ©2013 IEEE。

著录项

  • 作者

    Han G; Wang CL; Zhang C;

  • 作者单位
  • 年度 2013
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号