首页> 外文会议>IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems >Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs
【24h】

Performance Evaluation of Priority Queues for Fine-Grained Parallel Tasks on GPUs

机译:在GPU上进行细粒度平行任务的优先级队列的性能评估

获取原文

摘要

Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the bottleneck of an application. To our knowledge, currently no systematic comparison of priority queue implementations on GPUs exists in the literature. We close this gap by a performance evaluation of GPU-based priority queue implementations for two applications: discrete-event simulation and parallel A* path searches on grids. We focus on scenarios requiring large numbers of priority queues holding up to a few thousand items each. We present performance measurements covering linear queue designs, implicit binary heaps, splay trees, and a GPU-specific proposal from the literature. The measurement results show that up to about 500 items per queue, circular buffers frequently outperform tree-based queues for the considered applications, particularly under a simple parallelization of individual item enqueue operations. We analyze profiling metrics to explore classical queue designs in light of the importance of high hardware utilization as well as homogeneous computations and memory accesses across GPU threads.
机译:图形处理单元(GPU)越来越多地应用于加速任务,例如图形问题,并且具有不规则性的特征的分离仿真,即控制流程和存储器对输入上的基本依赖性。这些不规则任务中许多的核心数据结构是指导计算进度的优先级队列,并且可以容易地成为应用程序的瓶颈。据我们所知,目前在文献中,目前没有对GPU上的优先级队列实施的系统比较。我们通过对两个应用程序的基于GPU的优先级队列实现的性能评估来关闭此差距:在网格上进行离散事件仿真和并行A *路径搜索。我们专注于需要大量优先队队列的情况,每个优先队列持有最多几千件物品。我们呈现涵盖线性队列设计,隐式二进制堆,SPLAY树和文献的GPU特定提案的性能测量。测量结果表明,每队列最多约500个项目,循环缓冲区频繁优于所考虑的应用程序的基于树的队列,特别是在单个项目enqueue操作的简单并行化下。我们分析了分析指标,鉴于高硬件利用率的重要性以及GPU线程的同类计算和内存访问,探讨古典队列设计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号