A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

机译：用于GPU和单芯片CPU / GPU系统的便携式高性能通用矩阵乘（GEMM）库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult problem. Further, OpenCL kernels tuned for a particular architecture perform poorly on other architectures. We present a solution to the challenge of writing a portable and high-performance GEMM implementation. We designed and implemented RaijinCL, an OpenCL auto-tuning library for real and complex variants of GEMM that automatically generates tuned kernels for a given architecture. We comprehensively tested our library on a wide variety of architectures and show that the library is competitive with vendor libraries on all tested architectures. We also implemented an autotuner for hybrid CPU+GPU GEMM that takes advantage of both the CPU and GPU on singlechip CPU+GPU platforms such as Intel Ivy Bridge. We show that our solution can outperform CPU-only, GPU-only as well as simple CPU+GPU tuning strategies. In addition to performance results, we provide analysis of architectural limitations as well as OpenCL compiler and runtime issues discovered on various systems, along with guidance on avoiding some of these issues.

机译：OpenCL是供应商中立且可移植的接口，用于对并行计算设备（例如GPU）进行编程。为特定设备调整重要库功能（例如密集通用矩阵乘法（GEMM））的OpenCL实现是一个难题。此外，针对特定体系结构调整的OpenCL内核在其他体系结构上的性能较差。我们提出了一种解决方案，以应对编写可移植的高性能GEMM实现的挑战。我们设计和实现了RaijinCL，这是一个OpenCL自动调整库，用于GEMM的实际和复杂变体，可以自动为给定体系结构生成调整后的内核。我们在各种架构上对我们的库进行了全面测试，并表明该库与所有经过测试的架构上的供应商库相比都具有竞争力。我们还为混合CPU + GPU GEMM实施了自动调谐器，该自动调谐器利用了单芯片CPU + GPU平台（例如Intel Ivy Bridge）上的CPU和GPU。我们证明了我们的解决方案可以胜过仅CPU，仅GPU以及简单的CPU + GPU调整策略。除了性能结果外，我们还提供对体系结构局限性的分析以及在各种系统上发现的OpenCL编译器和运行时问题，并提供避免此类问题的指南。

著录项

来源
《Euromicro International Conference on Parallel, Distributed, and Network-Based Processing》|2014年|672-680|共9页
会议地点
作者
Garg Rahul; Hendren Laurie;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
BLAS; CUDA; GEMM; GPGPU; Ivy Bridge; OpenCL; autotuning; heterogeneous computing;

机译：BLAS; CUDA; GEMM; GPGPU; Ivy Bridge; OpenCL;自整定;异构计算;

相似文献

外文文献
中文文献
专利

1. PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs [J] . L.Yu. Barash, L.N. Shchur Computer physics communications . 2014,第4期

机译：PRAND：GPU加速的并行随机数生成库：使用最可靠的算法并应用现代GPU和CPU的并行性
2. DiSCaMB: a software library for aspherical atom model X-ray scattering factor calculations with CPUs and GPUs [J] . Micha? L. Chodkiewicz, Szymon Migacz, Witold Rudnicki, Journal of Applied Crystallography . 2018,第1期

机译：DISCAMB：具有CPU和GPU的非球面原子模型X射线散射因子计算的软件库
3. ThunderSVM: A Fast SVM Library on GPUs and CPUs [J] . Zeyi Wen, Jiashuai Shi, Qinbin Li, Journal of machine learning research . 2018,第a期

机译：雷杂散：GPU和CPU上的一个快速SVM库
4. A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems [C] . Garg Rahul, Hendren Laurie Euromicro International Conference on Parallel, Distributed, and Network-Based Processing . 2014

机译：用于GPU和单芯片CPU / GPU系统的便携式和高性能常规矩阵 - 乘法（GEMM）库
5. Toward Performance Portability for CPUs and GPUs through Algorithmic Compositions [D] . Chang, Li-Wen. 2017

机译：通过算法组合实现CPU和GPU的性能可移植性
6. DiSCaMB: a software library for aspherical atom model X-ray scattering factor calculations with CPUs and GPUs [O] . Michał L. Chodkiewicz, Szymon Migacz, Witold Rudnicki, -1

机译：DiSCaMB：用于使用CPU和GPU计算非球面原子模型X射线散射因子的软件库
7. GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding [O] . Zhaocheng Zhu, Shizhen Xu, Jian Tang, 2019

机译：GraphVite：用于节点嵌入的高性能CPU-GPU混合系统

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

摘要

著录项

相似文献

相关主题

期刊订阅