首页> 外文会议>Euromicro International Conference on Parallel, Distributed, and Network-Based Processing >A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
【24h】

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

机译:用于GPU和单芯片CPU / GPU系统的便携式高性能通用矩阵乘(GEMM)库

获取原文

摘要

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult problem. Further, OpenCL kernels tuned for a particular architecture perform poorly on other architectures. We present a solution to the challenge of writing a portable and high-performance GEMM implementation. We designed and implemented RaijinCL, an OpenCL auto-tuning library for real and complex variants of GEMM that automatically generates tuned kernels for a given architecture. We comprehensively tested our library on a wide variety of architectures and show that the library is competitive with vendor libraries on all tested architectures. We also implemented an autotuner for hybrid CPU+GPU GEMM that takes advantage of both the CPU and GPU on singlechip CPU+GPU platforms such as Intel Ivy Bridge. We show that our solution can outperform CPU-only, GPU-only as well as simple CPU+GPU tuning strategies. In addition to performance results, we provide analysis of architectural limitations as well as OpenCL compiler and runtime issues discovered on various systems, along with guidance on avoiding some of these issues.
机译:OpenCL是供应商中立且可移植的接口,用于对并行计算设备(例如GPU)进行编程。为特定设备调整重要库功能(例如密集通用矩阵乘法(GEMM))的OpenCL实现是一个难题。此外,针对特定体系结构调整的OpenCL内核在其他体系结构上的性能较差。我们提出了一种解决方案,以应对编写可移植的高性能GEMM实现的挑战。我们设计和实现了RaijinCL,这是一个OpenCL自动调整库,用于GEMM的实际和复杂变体,可以自动为给定体系结构生成调整后的内核。我们在各种架构上对我们的库进行了全面测试,并表明该库与所有经过测试的架构上的供应商库相比都具有竞争力。我们还为混合CPU + GPU GEMM实施了自动调谐器,该自动调谐器利用了单芯片CPU + GPU平台(例如Intel Ivy Bridge)上的CPU和GPU。我们证明了我们的解决方案可以胜过仅CPU,仅GPU以及简单的CPU + GPU调整策略。除了性能结果外,我们还提供对体系结构局限性的分析以及在各种系统上发现的OpenCL编译器和运行时问题,并提供避免此类问题的指南。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号