Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking

机译：算法扁平化：要完全消除GPU的分支，需要从CPU思维上转变范式

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Graphics processing units (GPUs) have inadvertently become supercomputers in and of themselves, to the benefit of applications outside of graphics. Acceleration of multiple orders of magnitude has been achieved in scientific computing, co-processing and the like. However, the Single Instruction Multiple Data (SIMD) design of GPUs is extremely sensitive to thread divergence. So much so that performance improvement from GPUs is all but eviscerated by thread divergence for many applications. This problem has driven general purpose GPU computing in the direction of finding “appropriate” applications to accelerate, rather than accelerating applications with a need for performance improvements. Thread divergence is generally caused by branches. Previous research has addressed the issue of reducing branches, but none of this work aims to entirely eliminate branches, because the methods required for complete branch elimination are a drastic de-optimization for CPU. We present Algorithm Flattening (AF), a de-optimization for CPU which completely removes all branches, and results in a significant optimization for GPU accelerated applications. AF eliminates thread divergence, substantially decreases execution time, allows for the implementation of algorithms on GPU which previously do not fully utilize GPU capability and generates deterministic performance. AF removes branches, replacing them with a reduced equation, and achieves a substantial speedup of already GPU accelerated algorithms and applications. We believe that AF will have a significant impact on high performance computing as it is a long needed solution that allows unprecedented use of GPUs for general purpose applications.

机译：图形处理单元（GPU）本身已无意中成为了超级计算机，从而受益于图形外部的应用程序。在科学计算，协同处理等中已经实现了多个数量级的加速。但是，GPU的单指令多数据（SIMD）设计对线程分歧非常敏感。如此之多以至于GPU的性能提高几乎被许多应用程序的线程分歧所抵消。这个问题已将通用GPU计算推向寻找“合适的”应用程序以加速而不是加速需要性能改进的应用程序的方向。线程分歧通常是由分支引起的。先前的研究已经解决了减少分支的问题，但是这项工作的目的都不是要完全消除分支，因为完全消除分支所需的方法是对CPU的急剧优化。我们提出了算法展平（AF），这是一种针对CPU的去优化技术，它完全消除了所有分支，并为GPU加速的应用程序带来了重大优化。 AF消除了线程分歧，大大减少了执行时间，允许在GPU上实施以前无法充分利用GPU功能并产生确定性性能的算法。 AF删除了分支，用简化的等式替换了它们，并大大提高了GPU加速算法和应用程序的速度。我们相信，自动对焦将对高性能计算产生重大影响，因为它是一个长期需要的解决方案，它允许对通用应用程序进行空前的GPU使用。

著录项

来源
《IEEE Conference on High Performance Extreme Computing》|2015年|1-6|共6页
会议地点
作者
Vespa Lucas; Bauman Alexander; Wells Jenny;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
general purpose computers; graphics processing units; optimisation; parallel processing; CPU; SIMD design; algorithm flattening; complete branch elimination; drastic de-optimization; general purpose GPU computing; graphics processing units; single instruction multiple data; supercomputers; Acceleration; Graphics processing units; Instruction sets; Kernel; Mathematical model; Optimization;

机译：通用计算机;图形处理单元;优化;并行处理; CPU; SIMD设计;算法展平;完全消除分支;大幅去优化;通用GPU计算;图形处理单元;单指令多数据;超级计算机;加速;图形处理单位;指令集;内核;数学模型;优化;
入库时间 2022-08-26 15:00:58

相似文献

外文文献
中文文献
专利

1. PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs [J] . L.Yu. Barash, L.N. Shchur Computer physics communications . 2014,第4期

机译：PRAND：GPU加速的并行随机数生成库：使用最可靠的算法并应用现代GPU和CPU的并行性
2. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs [J] . Jan Fostier BMC Bioinformatics . 2020,第2期

机译：λ基于BLAS基于CPU和GPU的DNA序列中的位置重量矩阵出现的算法
3. A comparison-free sorting algorithm on CPUs and GPUs [J] . Abdel-hafeez Saleh, Gordon-Ross Ann, Abubaker Samer Journal of supercomputing . 2018,第11期

机译：CPU和GPU上的免比较排序算法
4. Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking [C] . Vespa Lucas, Bauman Alexander, Wells Jenny IEEE Conference on High Performance Extreme Computing . 2015

机译：展平算法：GPU的完整分支消除需要CPU思维的范式转变
5. Efficient Viewshed Computation Algorithms on GPUs and CPUs [D] . Qarah, Faisal F. 2020

机译：GPU和CPU上有效的viewShed计算算法
6. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs [O] . Jan Fostier 2020

机译：BLAMM：基于BLAS的算法用于查找CPU和GPU上DNA序列中的位置权重矩阵
7. PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs [O] . Barash, L. Yu., Shchur, L. N. 2014

机译：pRaND：GpU加速并行随机数生成库：使用最可靠的算法，并应用现代GpU和CpU的并行性

Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking

摘要

著录项

相似文献

相关主题

期刊订阅