首页> 外文会议>IEEE/ACM international symposium on cluster, cloud and grid computing >GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs
【24h】

GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

机译:GPU-TLS:在GPU上进行推测性循环并行化的高效运行时

获取原文

摘要

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both micro benchmarks and real-life applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to- 160 for sequential programs with possibly-parallel loops.
机译:最近,GPU已成为HPC和云环境中通用应用程序的一种重要并行平台。由于特殊的执行模型,即使最近引入了诸如CUDA和OpenCL之类的高级语言,也很难为GPU开发程序。为了简化编程工作,一些研究提出了通过复杂的编译时技术自动生成并行GPU代码的方法。但是,这种方法只能并行化100%的循环,而没有迭代间的依赖关系(即DOALL循环)。为了利用无法通过静态分析证明的运行时并行性,在本文中,我们提出了GPU-TLS,这是一种运行时系统,用于在GPU上的顺序程序中推测性地并行化可能并行的循环。 GPU-TLS将可能的并行循环切成较小的子循环,从而将其并行化,每个子循环由GPU内核并行执行,并推测不存在迭代间的依赖关系。经过依赖性检查后,将缓冲的无误推测的迭代写操作复制到主存储器,同时重新执行遇到误推测的迭代。 GPU-TLS解决了GPU上的推测性循环并行化的几个关键问题:(1)通过以下三种方法减少了由大量线程导致的更大的错误推测率:循环斩波并行化方法,延迟内存更新方案和内部扭曲价值转发方法。 (2)通过混合方案减少了依赖检查的较大开销:热切的warp内部依赖检查与懒惰warp间依赖检查相结合。 (3)并行提交方案缓解了串行提交的瓶颈,该方案允许不同的迭代以无序方式进入提交阶段,但仍保证了顺序语义。在两张最新的NVIDIA GPU卡上同时使用微基准测试和实际应用程序进行的广泛评估显示,对于可能具有并行循环的顺序程序,使用GPU-TLS进行推测性循环并行化可以将速度提高5到160。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号