首页> 外文会议>Symposium on Application Accelerators in High Performance Computing >On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering
【24h】

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

机译:提高多线程CUDA应用程序与核心重新排序的并发内核执行的性能

获取原文

摘要

General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the execution of up to 16 GPU kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU kernel invocations from within parallel host regions by reordering the respective GPU kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.
机译:通用图形处理单元(GPU)已被发现是与大规模并行的内在潜力的大规模数值计算可行的解决方案。相比之下,只有少数人知道使用GPU的小规模计算。具有GPU不能得到充分利用的用于小尺寸的问题,一个有意义的方法是如在同时的方式许多小规模计算尽可能执行。在NVIDIA费米的GPU,同时内核执行(CKE)的概念允许多达16个GPU内核在单个装置上的执行。虽然在单线程CUDA程序使用CKE很简单,对多线程程序可能成为管理多个主机线程与GPU设备交互的挑战,除了正常有CKE概念的工作。可以观察到,CKE性能发生故障时,多个主机线程调用每连续多GPU内核不同步他们的行动。由于在实际应用中是很常见的多主机线程独立处理自己的数据,需要一种机制,有助于避免CKE击穿。我们提出了一个生产者 - 消费者的原则的方法来重新排序各自的GPU从并行主机区域内的管理GPU内核调用之前的内核其实调用它们。我们能够证明这种技术在纳米液滴中溶剂化小分子的强大的缩放模拟显著的性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号