Improving Performance of Matrix Multiplication and FFT on GPU

机译：提高Matrix乘法和FFT对GPU的性能

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.

机译：在本文中，我们讨论了我们在提高两个关键算法的性能方面的经验：使用CUDA的单精度矩阵矩阵乘法（BLAS的SGAM）和单精度FFT。前者是计算密集型，而后者是内存带宽或通信密集型。对于前者的NVIDIA GeForce GTX280实现了393 GFlops的峰值性能，比Cublas 2.0库更快地约为5％。为一系列尺寸获得更好的FFT性能结果。讨论了许多核心算法的设计和实现的一些共同原则。

著录项

来源
《International Conference on Parallel and Distributed Systems》|2009年||共7页
会议地点
作者
Xiang Cui; Yifeng Chen; Hong Mei;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP316.4-53;
关键词
CUDA; GPU; Matrix multiplication; FFT;

机译：CUDA;GPU;矩阵乘法;FFT;

相似文献

外文文献
中文文献
专利

1. TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs [J] . Cody Rivera, Jieyang Chen, Nan Xiong, Journal of Parallel and Distributed Computing . 2021,第May期

机译：TSM2X：GPU上的高性能高瘦矩阵矩阵乘法
2. Improving Accuracy for Matrix Multiplications on GPUs [J] . MatthewBadin, LubomirBic, MichaelDillencourt, Scientific programming . 2011,第1期

机译：提高GPU上矩阵乘法的精度
3. Improving accuracy for matrix multiplications on GPUs [J] . Matthew Badin, Lubomir Bic, Michael Dillencourt, Scientific programming . 2011,第1期

机译：提高GPU上矩阵乘法的精度
4. Improving Performance of Matrix Multiplication and FFT on GPU [C] . Xiang Cui, Yifeng Chen, Hong Mei International Conference on Parallel and Distributed Systems . 2009

机译：提高Matrix乘法和FFT对GPU的性能
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. AMIDE v2: High-Throughput Screening Based on AutoDock-GPU and Improved Workflow Leading to Better Performance and Reliability [O] . Pierre Darme, Manuel Dauchez, Arnaud Renard, 2021

机译：amide v2：基于Autodock-GPU的高吞吐量筛选和改进的工作流程导致更好的性能和可靠性
7. On improving performance of sparse matrix-matrix multiplication on GPUs [O] . Rakshith Kunchum, Ankur Chaudhry, Aravind Sukumaran-Rajam, 2017

机译：提高GPU稀疏矩阵矩阵乘法的性能

Improving Performance of Matrix Multiplication and FFT on GPU

摘要

著录项

相似文献

相关主题

期刊订阅