On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

机译：关于通过内核重排序提高并发内核执行能力的多线程CUDA应用程序的性能

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

General-purpose graphics processing units (GPUs) have been found to be viable solutions for large-scale numerical computations with an inherent potential for massive parallelism. In contrast, only few is known about using GPUs for small-scale computations. To have the GPU not be under-utilized for small problem sizes, a meaningful approach is to perform as many small-scale computations as possible in a concurrent manner. On NVIDIA Fermi GPUs, the concept of Concurrent Kernel Execution (CKE) allows for the execution of up to 16 GPU kernels on a single device. While using CKE in single-threaded CUDA programs is straightforward, for multi-threaded programs it might become a challenge to manage multiple host threads interacting with the GPU device, and in addition to have the CKE concept work properly. It can be observed that CKE performance breaks down when multiple host threads each invoke multiple GPU kernels in succession without synchronizing their actions. Since in real-world applications it is common that multiple host threads process their data independently, a mechanism is needed that helps avoiding CKE breakdown. We propose a producer-consumer principle approach to manage GPU kernel invocations from within parallel host regions by reordering the respective GPU kernels before actually invoking them. We are able to demonstrate significant performance improvements with this technique in a strong scaling simulation of a small molecule solvated within a nanodroplet.

机译：通用图形处理单元（GPU）已被发现是大规模数值计算的可行解决方案，具有大规模并行性的内在潜力。相反，对于使用GPU进行小规模计算的了解很少。为了使GPU不会因小问题大小而未得到充分利用，一种有意义的方法是以并发方式执行尽可能多的小规模计算。在NVIDIA Fermi GPU上，并发内核执行（CKE）的概念允许在单个设备上最多执行16个GPU内核。尽管在单线程CUDA程序中使用CKE很简单，但对于多线程程序，管理与GPU设备交互的多个主机线程可能成为挑战，并且要使CKE概念正常工作。可以观察到，当多个主机线程各自连续调用多个GPU内核而不同步其动作时，CKE性能就会下降。由于在实际应用中，多个主机线程独立地处理它们的数据是很常见的，因此需要一种有助于避免CKE故障的机制。我们提出了一种生产者-消费者原则方法，通过在实际调用它们之前对各个GPU内核进行重新排序来从并行主机区域内管理GPU内核调用。我们能够在纳米液滴中溶剂化的小分子的强比例缩放仿真中证明使用此技术可显着改善性能。

著录项

来源
《2012 Symposium on Application Accelerators in High Performance Computing.》|2012年|p.74- 83|共10页
会议地点 Argonne IL(US)
作者
Wende Florian; Cordes Frank; Steinke Thomas;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Using machine learning techniques to analyze the performance of concurrent kernel execution on GPUs [J] . Pablo Carvalho, Esteban Clua, Aline Paes, Future generation computer systems . 2020,第Deca期

机译：使用机器学习技术分析GPU上并发内核执行的性能
2. Maximizing the GPU resource usage by reordering concurrent kernels submission [J] . Rommel A.Q. Cruz, Cristiana Bentes, Bernardo Breder, Concurrency and computation: practice and experience . 2019,第18期

机译：通过重新排序并发内核提交来最大程度地利用GPU资源
3. Maximizing the GPU resource usage by reordering concurrent kernels submission [J] . Rommel A.Q. Cruz, Cristiana Bentes, Bernardo Breder, Concurrency and computation: practice and experience . 2019,第18期

机译：通过重新排序并发内核提交来最大限度地提高GPU资源使用情况
4. On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering [C] . Wende Florian, Cordes Frank, Steinke Thomas Symposium on Application Accelerators in High Performance Computing . 2012

机译：提高多线程CUDA应用程序与核心重新排序的并发内核执行的性能
5. Characterization and Exploitation of Nested Parallelism and Concurrent Kernel Execution to Accelerate High Performance Applications. [D] . Nina Paravecino, Fanny. 2017

机译：嵌套并行和并行内核执行的特性和开发，以加速高性能应用程序。
6. An Improved Kernel Based Extreme Learning Machine for Robot Execution Failures [O] . Bin Li, Xuewen Rong, Yibin Li -1

机译：一种改进的基于内核的机器人执行失败极限学习机
7. Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS [O] . Qing Jiao, Mian Lu, Huynh Phung, 2015

机译：通过并发内核执行和DVFs提高GpGpU的能效

On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅