首页> 外文OA文献 >Matrix Multiplication Beyond Auto-Tuning: Rewrite-Based GPU Code Generation
【2h】

Matrix Multiplication Beyond Auto-Tuning: Rewrite-Based GPU Code Generation

机译:超越自动调整的矩阵乘法:基于重写的GpU代码生成

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices.ududHowever, producing high performance GPU code is extremely difficult. Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has audvery different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs.ududAuto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classesudof GPUs, given the wide variety of hardware characteristics.ududIn this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a high-level programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile GPUs. In particular, we achieve a speedup of 1.7x over audstate-of-the-art auto-tuner on the ARM Mali GPU.
机译:图形处理单元(GPU)在广泛的应用中用作通用并行加速器。它们出现在大多数计算系统中,移动设备也不例外。诸如用于移动GPU的OpenCL之类的编程API的最新可用性有望在这些设备上打开新型应用程序。 ud ud但是,生成高性能GPU代码非常困难。当应用不同的优化时,设备特性的细微差异可能会导致较大的性能差异。就像我们将看到的那样,这对于移动GPU(例如ARM Mali GPU)尤其如此,其架构与台式机类GPU完全不同。针对一种类型的GPU进行优化和优化的代码不太可能在另一种类型的GPU上实现潜在的性能。 ud ud传统上,自动调谐器可以应对这种性能可移植性挑战。例如,它们已经在用于矩阵运算的CPU上获得了成功,而矩阵运算在许多高性能应用中被用作构建块。但是,鉴于各种硬件特性,针对不同类的udf GPU进行设计要困难得多。 ud ud在本文中,我们采用了不同的观点,并说明了如何使用编译器方法实现矩阵乘法的性能可移植性。该方法基于最近开发的通用技术,该技术将高级编程模型与重写规则系统结合在一起。程序会在连续的步骤中自动重写,从而做出优化决策。这种方法真正具有性能可移植性,可为各种类型的体系结构(例如台式机和移动GPU)生成高性能代码。尤其是,与ARM Mali GPU上最先进的自动调谐器相比,我们的速度提高了1.7倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号