Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

机译：在CUDA中优化并行约简原语的算法策略

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.

机译：许多通用应用程序通过执行一组众所周知的数据并行原语来利用图形处理单元（GPU）。这些原语通常会被主机多次调用，因此它们的吞吐量对整个系统的性能有很大的影响。因此，对新型算法策略进行研究以优化其在当前设备上的实现是GPU社区感兴趣的话题。在本文中，我们专注于优化归约原语，该原语仅使用二进制关联运算符将数据序列归为单个值。尽管已经在GPU上实现了基于树和基于序列的算法，但是尚未对两种算法的性能进行比较。因此，我们的第一个贡献是对CUDA上最先进的约简算法进行实验研究。接下来，我们介绍两种算法优化，它们被集成到最快的解决方案（基于序列的算法）中，从而进一步提高了吞吐量。最后，我们将此方法复制到图元的分段版本，当输入由几个独立的段组成时，将应用该方法。在这种情况下，尚不清楚哪种算法表现出最佳性能，因为吞吐量在很大程度上取决于段沿输入的分布。根据我们的结果，基于树的算法在小片段上的运行速度更快，而顺序算法对于大中片的算法则更好。

著录项

来源
《2012 International Conference on High Performance Computing amp; Simulation》|2012年|p.511- 519|共9页
会议地点 Madrid(ES)
作者
Martin Pedro J.; Ayuso Luis F.; Torres Roberto; Gavilanes Antonio;
展开▼
作者单位

Departamento de Sistemas Informáticos y Computación, Universidad Complutense de Madrid, Madrid, Spain;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类一般性问题;自动模拟理论（自动仿真理论）;
关键词

相似文献

外文文献
中文文献
专利

1. Evaluation of parallel particle swarm optimization algorithms within the CUDA? architecture [J] . Mussi L., Daolio F., Cagnoni S. Information Sciences: An International Journal . 2011,第20期

机译：评估CUDA中的并行粒子群优化算法？建筑
2. CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms [J] . LeeD., DinovI., DongB., Computer Methods and Programs in Biomedicine: An International Journal Devoted to the Development, Implementation and Exchange of Computing Methodology and Software Systems in Biomedical Research and Medical Practice . 2012,第3期

机译：用于计算和内存绑定神经成像算法的CUDA优化策略
3. Efficient strategy for parallelisation of multilevel fast multipole algorithm using CUDA [J] . Garcia Eliseo, Delgado Carlos, Lozano Lorena, Microwaves, Antennas & Propagation, IET . 2019,第10期

机译：使用CUDA的多级快速多极子算法并行化的高效策略
4. Algorithmic strategies for optimizing the parallel reduction primitive in CUDA [C] . Martin Pedro J., Ayuso Luis F., Torres Roberto, International Conference on High Performance Computing and Simulation . 2012

机译：用于优化CUDA的平行减少原语的算法策略
5. Parallel primitives as tools for implementing parallel algorithms: Definition, design and implementation. [D] . Weisbecker, James Robert. 1992

机译：并行原语作为实现并行算法的工具：定义，设计和实现。
6. CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms [O] . Daren Lee, Ivo Dinov, Bin Dong, -1

机译：CUDA优化策略用于计算和内存内存的神经影像算法
7. Algorithmic Strategies for Optimizing the Parallel Reduction Primitive in CUDA [O] . Pedro J. Martín, Luis F. Ayuso, Roberto Torres, 2015

机译：优化CUDa并行约简原语的算法策略

Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

摘要

著录项

相似文献

相关主题

期刊订阅