首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs
【24h】

GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

机译:GPU-TLS:GPU上的推测循环并行化的有效运行时

获取原文

摘要

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and real-life applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops.
机译:最近GPU作为通用应用程序的一个重要并行平台,在HPC和云环境中。由于特殊的执行模型,即使最近推出了CUDA和OpenCL等高级语言,也很难发展GPU的开发计划。为了简化编程工作,有些研究提出了通过复杂的编译时间技术自动生成并行GPU代码。但是,这种方法只能将循环平行化100%不迭代依赖性(即,DOALL LOOPS)。为了利用运行时并行性,在这项工作中,我们无法通过静态分析证明,我们提出了GPU-TLS,一个运行时系统,用于在GPU上顺序程序中的顺序程序中的可能并行环节。 GPU-TLS通过将其切换成较小的子环路并将其并行化可能并行环路,每个循环由GPU内核并行执行,推测不存在迭代间依赖性。在依赖检查之后,在遇到错误投机的迭代被重新执行时,将在没有错误调测的迭代的缓冲写入迭代写入。 GPU-TLS在GPU上解决了推测循环并行化的几个关键问题:(1)通过三种方法减少了由较大数量的线程引起的较大的错误猜测速率:环路切割并行化方法,延迟记忆更新方案和跨境值转发方法。 (2)通过混合方案减少了依赖性检查的较大开销:渴望跨境依赖性检查与惰性跨性依赖检查相结合。 (3)并行提交方案缓解了串行提交的瓶颈,这允许不同的迭代才能在订单中输入提交阶段,但仍然保证连续语义。在最近的两个NVIDIA GPU卡上使用微磁发布和现实生活应用的广泛评估表明,使用GPU-TLS的投机循环并行化可以实现具有可能并行环路的顺序程序的5至160的加速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号