Scalable Kernel Fusion for Memory-Bound GPU Applications

机译：适用于内存绑定GPU应用的可扩展内核融合

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory, kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels' precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.

机译：依靠有限差分方法的HPC应用程序的GPU实现可以包括数十个受内存限制的内核。内核融合可以通过减少流向片外存储器的数据流量来提高性能，共享数据阵列的内核与更大的内核融合在一起，在更大的内核中，片上缓存用于保存来自不同内核的指令所重用的数据。主要挑战是：a）在受到数据依赖关系和内核优先级约束的同时，寻找最佳的内核融合方法; b）有效地应用内核融合方法来实现加速。本文介绍了问题定义，并提出了一种可扩展的方法来搜索可能的核融合的空间，以识别大型问题的最佳核融合。本文还提出了一种无代码性能上限投影模型，以实现有效的融合。结果表明，使用提议的可扩展方法进行内核融合可将两个包含数十个内核的实际应用程序的性能提高1.35倍和1.2倍。

著录项

来源
《》|2014年|191-202|共12页
会议地点
作者
Wahib Mohamed; Maruyama Naoya;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
cache storage; finite difference methods; graphics processing units; parallel processing; performance evaluation; HPC applications; codeless performance upper-bound projection model; data arrays; data dependencies; data traffic; finite difference methods; kernel precedences; memory-bound GPU applications; memory-bound kernels; off-chip memory; on-chip cache; optimal kernel fusions; scalable kernel fusion; Arrays; Graphics processing units; Instruction sets; Kernel; Meteorology; Optimization; System-on-chip;

机译：高速缓存存储;有限差分方法;图形处理单元;并行处理;性能评估; HPC应用程序;无代码性能上限投影模型;数据阵列;数据依赖关系;数据流量;有限差分方法;内核优先级;内存绑定GPU应用程序;内存绑定内核;片外存储器;片上缓存;最佳内核融合;可扩展内核融合;阵列;图形处理单元;指令集;内核;气象学;优化;片上系统;

相似文献

外文文献
中文文献
专利

1. Using GPU's to Accelerate Stencil-based Computation Kernels for the Development of Large Scale Scientific Applications on Heterogeneous Systems [J] . Jian Tao, Marek Blazewicz, Steven R. Brandt ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2012,第8期

机译：使用GPU加速基于模板的计算内核，以开发异构系统上的大规模科学应用程序
2. InK-Compact: In-Kernel Stream Compaction and Its Application to Multi-Kernel Data Visualization on General-Purpose GPUs [J] . D. M. Hughes, I. S. Lim, M. W. Jones, Computer Graphics Forum: Journal of the European Association for Computer Graphics . 2013,第6期

机译：InK-Compact：内核流压缩及其在通用GPU上的多内核数据可视化中的应用
3. Accelerating explicit ODE methods on GPUs by kernel fusion [J] . Matthias Korch, Tim Werner Concurrency and computation: practice and experience . 2018,第18期

机译：通过内核融合在GPU上加速显式ODE方法
4. Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs [C] . Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi International Symposium on Embedded Multicore/Many-core Systems-on-Chip . 2016

机译：GPU上的内存绑定BLAS内核的自动线程块大小调整
5. Dynamic Voltage and Frequency Scaling for 3D Graphics Applications on the State-of-the-art Mobile GPUs [D] . Farazmand, Navid. 2018

机译：在最先进的移动GPU上的3D图形应用的动态电压和频率缩放
6. L2-norm multiple kernel learning and its application to biomedical data fusion [O] . Shi Yu, Tillmann Falck, Anneleen Daemen, 2010

机译：L2-范数多核学习及其在生物医学数据融合中的应用
7. A versatile software systolic execution model for GPU memory-bound kernels [O] . Peng Chen, Mohamed Wahib, Shinichiro Takizawa, 2019

机译：GPU内存内核的多功能软件收缩期执行模型
8. Scaling of Electron Beam Sources for Laser Fusion Applications. [R] . schlitt, l. g. bradley, l. p. 1975

机译：用于激光融合应用的电子束源的缩放。

Scalable Kernel Fusion for Memory-Bound GPU Applications

摘要

著录项

相似文献

相关主题

期刊订阅