首页> 外文会议>2012 International Conference on High Performance Computing amp; Simulation >Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
【24h】

Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

机译:在CUDA中优化并行约简原语的算法策略

获取原文
获取原文并翻译 | 示例

摘要

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
机译:许多通用应用程序通过执行一组众所周知的数据并行原语来利用图形处理单元(GPU)。这些原语通常会被主机多次调用,因此它们的吞吐量对整个系统的性能有很大的影响。因此,对新型算法策略进行研究以优化其在当前设备上的实现是GPU社区感兴趣的话题。在本文中,我们专注于优化归约原语,该原语仅使用二进制关联运算符将数据序列归为单个值。尽管已经在GPU上实现了基于树和基于序列的算法,但是尚未对两种算法的性能进行比较。因此,我们的第一个贡献是对CUDA上最先进的约简算法进行实验研究。接下来,我们介绍两种算法优化,它们被集成到最快的解决方案(基于序列的算法)中,从而进一步提高了吞吐量。最后,我们将此方法复制到图元的分段版本,当输入由几个独立的段组成时,将应用该方法。在这种情况下,尚不清楚哪种算法表现出最佳性能,因为吞吐量在很大程度上取决于段沿输入的分布。根据我们的结果,基于树的算法在小片段上的运行速度更快,而顺序算法对于大中片的算法则更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号