A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Chetan Jhurani; Paul Mullowney

首页> 外文期刊>Journal of Parallel and Distributed Computing >A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

【24h】

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

机译：用于多个小型矩阵的NVIDIA GPU上的GEMM接口和实现

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present an interface and an implementation of the General Matrix Multiply (CEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (CPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precisions for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include finite element computation on GPUs.

机译：我们为在NVIDIA图形处理单元（CPU）上同时处理的多个小型矩阵提供了通用矩阵乘法（CEMM）例程的接口和实现。我们专注于16以下的矩阵大小。该实现可以轻松扩展到更大的大小。对于单精度矩阵，我们的实现比NVIDIA Tesla K20c上CUDA Toolkit 5.0中分发的批量cuBLAS实现快30％至600％。例如，当分别乘以100,000个大小为10和16的独立矩阵对时，我们获得104 GFlop / s和216 GFlop / s。对于其他大小，对于实型和复杂类型，当矩阵数较小时，可以以单精度和双精度获得类似的性能改进。除了我们的实现之外，我们不同的功能接口在提高性能方面也起着重要作用。该软件的应用包括GPU上的有限元计算。

著录项

来源
《Journal of Parallel and Distributed Computing》 |2015年第1期|133-140|共8页
作者
Chetan Jhurani; Paul Mullowney;
展开▼
作者单位

Tech-X Corporation, 5621 Arapahoe Ave, Boulder, CO 80303, USA;

Tech-X Corporation, 5621 Arapahoe Ave, Boulder, CO 80303, USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
NVIDIA CUDA; GPU; GEMM; BLAS; cuBLAS; Parallel programming; Dense linear algebra;

机译：NVIDIA CUDA;GPU;GEMM;BLAS;立方玻璃;并行编程密集线性代数;

相似文献

外文文献
中文文献
专利

1. Efficient implementation of OpenACC cache directive on NVIDIA GPUs [J] . Ahmad Lashgar, Amirali Baniasadi International Journal of High Performance Computing and Networking . 2019,第1期

机译：高效实现NVIDIA GPU上的OPEACC缓存指令
2. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs [J] . Jakub Kurzak, Hartwig Anzt, Mark Gates, IEEE Transactions on Parallel and Distributed Systems . 2016,第7期

机译：NVIDIA GPU的批量Cholesky分解和解决方案的实现和优化
3. Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria [J] . Hector Ortega-Arranz, Yuri Torres, Arturo Gonzalez-Escribano, Journal of supercomputing . 2014,第2期

机译：使用内核表征标准为NVIDIA GPU优化APSP实施
4. Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster [C] . Veerendra Allada, Troy Benjegerdes, Brett Bode IEEE International Conference on Cluster Computing and Workshops . 2009

机译：NVIDIA TESLA GPU集群内存转移和GEMM子程序的性能分析
5. GPU acceleration of object classification algorithms using NVIDIA CUDA. [D] . Harvey, Jesse Patrick. 2009

机译：使用NVIDIA CUDA加速对象分类算法的GPU。
6. Towards the clinical implementation of iterative low-dose cone-beam CT reconstruction in image-guided radiation therapy: Cone/ring artifact correction and multiple GPU implementation [O] . Hao Yan, Xiaoyu Wang, Feng Shi, -1

机译：在图像引导放射治疗中实现迭代式小剂量锥束CT重建的临床实施：锥/环伪影校正和多GPU实施
7. A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices [O] . Chetan Jhurani, Paul Mullowney 2016

机译：用于多个小型矩阵的NVIDIa GpU上的GEmm接口和实现

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅