首页> 外文期刊>Concurrency and computation: practice and experience >CUDA-quicksort: an improved GPU-based implementation of quicksort
【24h】

CUDA-quicksort: an improved GPU-based implementation of quicksort

机译:CUDA-quicksort:改进的基于GPU的quicksort实现

获取原文
获取原文并翻译 | 示例

摘要

Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General-purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize some sorting algorithms. Two GPU-based implementations of the quicksort were presented in literature: the GPU-quicksort, a compute-unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA-quicksort an iterative GPU-based implementation of the sorting algorithm. CUDA-quicksort has been designed starting from GPU-quicksort. Unlike GPU-quicksort, it uses atomic primitives to perform inter-block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA-quicksort is up to four times faster than GPU-quicksort and up to three times faster than CDP-quicksort. An in-depth analysis of the performance between CUDA-quicksort and GPU-quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA-quicksort. Experimental results show that CUDA-quicksort is faster than the CDP-quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd.
机译:排序是计算机科学中非常重要的任务,对于大量使用排序算法的程序而言,排序成为一项至关重要的操作。通用计算已成功用于图形处理单元(GPU)上,以并行化某些排序算法。文献中介绍了两种基于GPU的快速排序实现:GPU快速排序(一种计算统一的设备体系结构(CUDA)迭代实现)和CUDA动态并行(CDP)快速排序(一种由NVIDIA Corporation提供的递归实现)。我们提出CUDA-quicksort,它是基于迭代GPU的排序算法实现。 CUDA-quicksort是从GPU-quicksort开始设计的。与GPU快速排序不同,它使用原子原语执行块间通信,同时确保对GPU内存的优化访问。对六个排序基准分布进行的实验表明,CUDA-quicksort比GPU-quicksort快四倍,比CDP-quicksort快三倍。对CUDA-quicksort和GPU-quicksort之间的性能进行的深入分析表明,主要的改进与优化的GPU内存访问有关,而不是与原子基元的使用有关。此外,为了评估使用CUDA动态并行性的优势,我们实现了CUDA-quicksort的递归版本。实验结果表明,CUDA-quicksort比NVIDIA提供的CDP-quicksort更快,并且使用迭代实现可实现更好的性能。版权所有©2015 John Wiley&Sons,Ltd.

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号