HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Kang Homin; Kwon Hyuck Chan; Kim Duksu

首页> 外文期刊>Computing >HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

【24h】

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

机译：HPMAX：异构并行矩阵使用CPU和GPU乘法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen's method, we represent matrix multiplication work as a set of matrix addition and multiplication tasks among their sub-matrices. Then, we distribute the tasks to CPUs and GPUs while considering the characteristics of the tasks and computing resources to minimize the data communication overhead and fully utilize the available computing power. To handle a large matrix efficiently with limited GPU memory, we also propose a block-based work decomposition method. We then further improve the performance of our method by exploiting the concurrent execution abilities of a heterogeneous parallel computing system. We implemented our method on five different heterogeneous systems and applied it to matrices of various sizes. Our method generally shows higher performance than the prior GPU-based matrix multiplication methods. Moreover, compared with the state-of-the-art GPU matrix multiplication library (i.e., CUBLAS), our method achieved up to 1.97 times higher performance using the same GPUs and CPU cores. In some cases, our method using a low-performance GPU (e.g., GTX 1060, 3 GB) achieved performance comparable to that of CUBLAS using a high-performance GPU (e.g., RTX 2080, 8 GB). Also, our method continually improves performance as we use more computing resources like additional CPU cores and GPUs. We could achieve such high performance because our approach fully utilized the capacities of the given heterogeneous parallel computing systems while employing the Strassen's method, which has a lower asymptotic complexity. These results demonstrate the efficiency and robustness of our algorithm.

机译：我们提出了一种新的异构并行矩阵乘法算法，其利用用于大规模矩阵的中央处理单元（CPU）和图形处理单元（GPU）。基于Strassen的方法，我们将矩阵乘法作用称为它们子矩阵中的一组矩阵加法和乘法任务。然后，我们在考虑任务和计算资源的特征时将任务分发到CPU和GPU，以最小化数据通信开销并充分利用可用的计算能力。为了高效地处理大型矩阵，我们还提出了一种基于块的工作分解方法。然后，我们通过利用异构并行计算系统的并发执行能力，进一步提高我们的方法的性能。我们在五种不同的异构系统上实施了我们的方法，并将其应用于各种尺寸的矩阵。我们的方法通常显示出比先前的基于GPU的矩阵乘法方法更高的性能。此外，与最先进的GPU矩阵乘法库（即CUBLA）相比，我们的方法使用相同的GPU和CPU核心实现了比性能更高的1.97倍。在某些情况下，我们使用低性能GPU（例如GTX 1060,3 GB）的方法实现了使用高性能GPU（例如RTX 2080,8 GB）的Cublas的性能。此外，我们的方法不断提高性能，因为我们使用更多计算资源，如附加的CPU内核和GPU。我们可以实现如此高的性能，因为我们的方法充分利用了给定的异构平行计算系统的容量，同时采用了脱枝的方法，这具有较低的渐近复杂性。这些结果展示了我们算法的效率和稳健性。

著录项

来源
《Computing》 |2020年第12期|2607-2631|共25页
作者
Kang Homin; Kwon Hyuck Chan; Kim Duksu;
展开▼
作者单位

Korea Univ Technol & Educ KOREATECH Cheonan South Korea;

Korea Univ Technol & Educ KOREATECH Cheonan South Korea;

Korea Univ Technol & Educ KOREATECH Cheonan South Korea;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Matrix multiplication; Parallel algorithm; Heterogeneous; GPU; Strassen;

机译：矩阵乘法;并行算法;异构;GPU;绞圈;

相似文献

外文文献
中文文献
专利

1. An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs [J] . Youngsub Ko, Youngmin Yi, Soonhoi Ha Journal of Real-Time Image Processing . 2014,第1期

机译：在由CPU和GPU组成的异构平台上针对x264编码器的高效并行化技术
2. PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs [J] . L.Yu. Barash, L.N. Shchur Computer physics communications . 2014,第4期

机译：PRAND：GPU加速的并行随机数生成库：使用最可靠的算法并应用现代GPU和CPU的并行性
3. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors [J] . Weifeng Liu, Brian Vinter Journal of Parallel and Distributed Computing . 2015,第NOVa期

机译：GPU和异构处理器上的通用稀疏矩阵矩阵乘法的框架
4. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment [C] . Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, High Performance Computing for Computational Science - VECPAR 2006; Lecture Notes in Computer Science; 4395 . 2006

机译：CPU和GPU异构环境中矩阵乘法的并行处理
5. Optimizing Tall-and-skinny Matrix-matrix Multiplication on GPUs [D] . Xiong, Nan 2018

机译：在GPU上优化高而瘦的矩阵矩阵乘法
6. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs [O] . Jan Fostier 2020

机译：BLAMM：基于BLAS的算法用于查找CPU和GPU上DNA序列中的位置权重矩阵
7. Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment [O] . Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, 2006

机译：在CPU和GPU异构环境中并行处理矩阵乘法

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅