首页> 外文期刊>Journal of Parallel and Distributed Computing >A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices
【24h】

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

机译:用于多个小型矩阵的NVIDIA GPU上的GEMM接口和实现

获取原文
获取原文并翻译 | 示例

摘要

We present an interface and an implementation of the General Matrix Multiply (CEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (CPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precisions for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include finite element computation on GPUs.
机译:我们为在NVIDIA图形处理单元(CPU)上同时处理的多个小型矩阵提供了通用矩阵乘法(CEMM)例程的接口和实现。我们专注于16以下的矩阵大小。该实现可以轻松扩展到更大的大小。对于单精度矩阵,我们的实现比NVIDIA Tesla K20c上CUDA Toolkit 5.0中分发的批量cuBLAS实现快30%至600%。例如,当分别乘以100,000个大小为10和16的独立矩阵对时,我们获得104 GFlop / s和216 GFlop / s。对于其他大小,对于实型和复杂类型,当矩阵数较小时,可以以单精度和双精度获得类似的性能改进。除了我们的实现之外,我们不同的功能接口在提高性能方面也起着重要作用。该软件的应用包括GPU上的有限元计算。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号