Streamlining GPU Applications On the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping

机译：快速优化GPU应用程序：通过运行时线程数据重新映射消除线程发散

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops.rnThis paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on thernfly.

机译：由于GPU（图形处理单元）的强大计算能力和非凡的成本效率，已迅速成为一种有影响力的高性能计算平台。但是，由于GPU是为海量数据并行计算而设计的，因此其性能取决于GPU应用程序中条件语句的存在。在条件分支中，线程在哪条路径中分叉，采取不同路径的线程必须串行运行。这种差异通常会导致严重的性能下降，从而损害了许多包含非平凡分支或某些类型循环的应用程序对GPU的采用。本文针对使用运行时线程数据重映射来解决该问题进行了系统的研究。它介绍了GPU应用程序的抽象形式，在此基础上，它描述了使用引用重定向和数据布局转换来重新映射数据和线程以最大程度地减少线程差异。它讨论了重新映射技术的实际部署所面临的主要挑战，最值得注意的是，大的重新映射开销与由于线程散度对运行时值的依赖而需要即时进行重新映射之间的冲突。它提出了CPU-GPU流水线方案和标签分配移动（LAM）算法，以隐藏所有重新映射的开销，从而为这一挑战提供了解决方案。最后，它报告了通过对一组GPU应用程序进行重新映射而产生的显着性能改进，表明了在Therfly上简化GPU应用程序的技术的潜力。

著录项

来源
《24th ACM international conference on supercomputing 2010》|2010年|p.115-125|共11页
会议地点 Amsterdam(NL);Amsterdam(NL)
作者
Eddy Z. Zhang; Yunlian Jiang; Ziyu Guo; Xipeng Shen;
展开▼
作者单位

Computer Science Department The College of William and Mary, Williamsburg, VA, USA 23185;

rnComputer Science Department The College of William and Mary, Williamsburg, VA, USA 23185;

rnComputer Science Department The College of William and Mary, Williamsburg, VA, USA 23185;

rnComputer Science Department The College of William and Mary, Williamsburg, VA, USA 23185;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
GPGPU; thread divergence; thread-data remapping; CPU-GPU pipelining; data transformation;

机译：GPGPU；线程发散;线程数据重映射； CPU-GPU流水线；数据转换;

相似文献

外文文献
中文文献
专利

1. On-GPU thread-data remapping for nested branch divergence [J] . Huanxin Lin, Cho-Li Wang Journal of Parallel and Distributed Computing . 2020,第May期

机译：嵌套分支发散的On-GPU线程数据重新映射
2. On-GPU Thread-Data Remapping for Branch Divergence Reduction [J] . Lin Huanxin, Wang Cho-Li, Liu Hongyuan ACM Transactions on Architecture and Code Optimization . 2018,第3期

机译：用于分支发散的GPU线程数据重新传唤
3. Efficient low-latency packet processing using On-GPU Thread-Data Remapping [J] . Huanxin Lin, Cho-Li Wang Journal of Parallel and Distributed Computing . 2019,第Nova期

机译：使用On-GPU线程数据重新映射的高效低延迟数据包处理
4. Streamlining GPU Applications On the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping [C] . Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, ACM international conference on supercomputing . 2010

机译：简化GPU应用程序：通过运行时线程重新映射通过运行时的线程消除
5. On-GPU Thread-Data Remapping for Branch Divergence Reduction [O] . Huanxin Lin, Cho-Li Wang, Hongyuan Liu 2018

机译：用于分支发散的GPU线程数据重新传唤
6. Machining Elimination through Application of Thread Forming Fasteners in Net-Shaped Cast Holes. [R] . Cleaver, R., Cleaver, T., Talbott, R. 2012

机译：在网状铸孔中应用螺纹成形紧固件进行加工消除。

Streamlining GPU Applications On the Fly: Thread Divergence Elimination through Runtime Thread-Data Remapping

摘要

著录项

相似文献

相关主题

期刊订阅