GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

机译：GPU-TLS：GPU上的推测循环并行化的有效运行时

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and real-life applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops.

机译：最近GPU作为通用应用程序的一个重要并行平台，在HPC和云环境中。由于特殊的执行模型，即使最近推出了CUDA和OpenCL等高级语言，也很难发展GPU的开发计划。为了简化编程工作，有些研究提出了通过复杂的编译时间技术自动生成并行GPU代码。但是，这种方法只能将循环平行化100％不迭代依赖性（即，DOALL LOOPS）。为了利用运行时并行性，在这项工作中，我们无法通过静态分析证明，我们提出了GPU-TLS，一个运行时系统，用于在GPU上顺序程序中的顺序程序中的可能并行环节。 GPU-TLS通过将其切换成较小的子环路并将其并行化可能并行环路，每个循环由GPU内核并行执行，推测不存在迭代间依赖性。在依赖检查之后，在遇到错误投机的迭代被重新执行时，将在没有错误调测的迭代的缓冲写入迭代写入。 GPU-TLS在GPU上解决了推测循环并行化的几个关键问题：（1）通过三种方法减少了由较大数量的线程引起的较大的错误猜测速率：环路切割并行化方法，延迟记忆更新方案和跨境值转发方法。（2）通过混合方案减少了依赖性检查的较大开销：渴望跨境依赖性检查与惰性跨性依赖检查相结合。（3）并行提交方案缓解了串行提交的瓶颈，这允许不同的迭代才能在订单中输入提交阶段，但仍然保证连续语义。在最近的两个NVIDIA GPU卡上使用微磁发布和现实生活应用的广泛评估表明，使用GPU-TLS的投机循环并行化可以实现具有可能并行环路的顺序程序的5至160的加速度。

著录项

来源
《IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing》|2013年||共8页
会议地点
作者
Chenggang Zhang; Guodong Han; Cho-Li Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301-53;
关键词
GPGPU; Speculative Loop Parallelization; Thread-Level Speculation (TLS); GPU-TLS;

机译：GPGPU;投机循环并行化;线程拨款（TLS）;GPU-TLS;

相似文献

外文文献
中文文献
专利

1. New Data Structures to Handle Speculative Parallelization at Runtime [J] . Alvaro Estebanez, Diego R. Llanos, Arturo Gonzalez-Escribano International journal of parallel programming . 2016,第3期

机译：在运行时处理推测并行化的新数据结构
2. Optimizing Software Runtime Systems for Speculative Parallelization [J] . PARASKEVAS YIAPANIS, DEMIAN ROSAS-HAM, GAVIN BROWN, ACM Transactions on Architecture and Code Optimization . 2012,第4期

机译：优化软件运行时系统以进行推测性并行化
3. The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization [J] . Rauchwerger L., Padua D.A. IEEE Transactions on Parallel and Distributed Systems . 1999,第2期

机译：LRPD测试：具有私有化和约简并行化的循环的推测性运行时并行化
4. GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs [C] . Zhang Chenggang, Han Guodong, Wang Cho-Li IEEE/ACM international symposium on cluster, cloud and grid computing . 2013

机译：GPU-TLS：在GPU上进行推测性循环并行化的高效运行时
5. Efficient GPU Parallelization of the Agent-Based Models Using MASS CUDA Library [D] . Kosiachenko, Elizaveta. 2018

机译：使用质量CUDA文库的基于代理的模型的高效GPU并行化
6. PANET: A GPU-Based Tool for Fast Parallel Analysis of Robustness Dynamics and Feed-Forward/Feedback Loop Structures in Large-Scale Biological Networks [O] . Hung-Cuong Trinh, Duc-Hau Le, Yung-Keun Kwon -1

机译：PANET：基于GPU的工具可快速并行分析大型生物网络中的鲁棒性动力学和前馈/反馈回路结构
7. GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs [O] . Han G, Wang CL, Zhang C 2013

机译：GpU-TLs：GpU上推测性循环并行化的高效运行时

GPU-TLS: An Efficient Runtime for Speculative Loop Parallelization on GPUs

摘要

著录项

相似文献

相关主题

期刊订阅