GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

机译：GPU-TLS：在GPU上进行推测性循环并行化的高效运行时

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both micro benchmarks and real-life applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to- 160 for sequential programs with possibly-parallel loops.

机译：最近，GPU已成为HPC和云环境中通用应用程序的一种重要并行平台。由于特殊的执行模型，即使最近引入了诸如CUDA和OpenCL之类的高级语言，也很难为GPU开发程序。为了简化编程工作，一些研究提出了通过复杂的编译时技术自动生成并行GPU代码的方法。但是，这种方法只能并行化100％的循环，而没有迭代间的依赖关系（即DOALL循环）。为了利用无法通过静态分析证明的运行时并行性，在本文中，我们提出了GPU-TLS，这是一种运行时系统，用于在GPU上的顺序程序中推测性地并行化可能并行的循环。 GPU-TLS将可能的并行循环切成较小的子循环，从而将其并行化，每个子循环由GPU内核并行执行，并推测不存在迭代间的依赖关系。经过依赖性检查后，将缓冲的无误推测的迭代写操作复制到主存储器，同时重新执行遇到误推测的迭代。 GPU-TLS解决了GPU上的推测性循环并行化的几个关键问题：（1）通过以下三种方法减少了由大量线程导致的更大的错误推测率：循环斩波并行化方法，延迟内存更新方案和内部扭曲价值转发方法。（2）通过混合方案减少了依赖检查的较大开销：热切的warp内部依赖检查与懒惰warp间依赖检查相结合。（3）并行提交方案缓解了串行提交的瓶颈，该方案允许不同的迭代以无序方式进入提交阶段，但仍保证了顺序语义。在两张最新的NVIDIA GPU卡上同时使用微基准测试和实际应用程序进行的广泛评估显示，对于可能具有并行循环的顺序程序，使用GPU-TLS进行推测性循环并行化可以将速度提高5到160。

著录项

来源
《IEEE/ACM international symposium on cluster, cloud and grid computing》|2013年|120-127|共8页
会议地点 Delft(NL)
作者
Zhang Chenggang; Han Guodong; Wang Cho-Li;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
GPGPU; GPU-TLS; Speculative Loop Parallelization; Thread-Level Speculation (TLS);

机译：GPGPU； GPU-TLS；投机循环并行化；线程级推测（TLS）;

相似文献

外文文献
中文文献
专利

1. New Data Structures to Handle Speculative Parallelization at Runtime [J] . Alvaro Estebanez, Diego R. Llanos, Arturo Gonzalez-Escribano International journal of parallel programming . 2016,第3期

机译：在运行时处理推测并行化的新数据结构
2. Optimizing Software Runtime Systems for Speculative Parallelization [J] . PARASKEVAS YIAPANIS, DEMIAN ROSAS-HAM, GAVIN BROWN, ACM Transactions on Architecture and Code Optimization . 2012,第4期

机译：优化软件运行时系统以进行推测性并行化
3. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization [J] . Rauchwerger L., Padua D.A. IEEE Transactions on Parallel and Distributed Systems . 1999,第2期

机译：LRPD测试：具有私有化和约简并行化的循环的推测性运行时并行化
4. GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs [C] . Zhang Chenggang, Han Guodong, Wang Cho-Li IEEE/ACM international symposium on cluster, cloud and grid computing . 2013

机译：GPU-TLS：GPU上的推测循环并行化的有效运行时
5. Efficient GPU Parallelization of the Agent-Based Models Using MASS CUDA Library [D] . Kosiachenko, Elizaveta. 2018

机译：使用质量CUDA文库的基于代理的模型的高效GPU并行化
6. PANET: A GPU-Based Tool for Fast Parallel Analysis of Robustness Dynamics and Feed-Forward/Feedback Loop Structures in Large-Scale Biological Networks [O] . Hung-Cuong Trinh, Duc-Hau Le, Yung-Keun Kwon -1

机译：PANET：基于GPU的工具可快速并行分析大型生物网络中的鲁棒性动力学和前馈/反馈回路结构
7. GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs [O] . Han G, Wang CL, Zhang C 2013

机译：GpU-TLs：GpU上推测性循环并行化的高效运行时

GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅