Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

Ayaz ul Hassan Khan; Mayez Al-Mouhamed; Allam Fatayer; Nazeeruddin Mohammad

首页> 外文期刊>International journal of parallel programming >Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

【24h】

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

机译：在多核上使用有限递归的Strassen和Winograd算法优化矩阵乘法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the N × N sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying N ≥ 2048 and N ≥ 3072, respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20 × and 80 × for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14-26% for N ≥ 1024. Proposed implementations achieve 2.35 TFLOPS (67% of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries.

机译：多核系统基本上是为具有大数据并行性的应用程序而设计的。我们提出了一种基于Strassen和Winograd算法（S-MM和W-MM）的多核高效混合矩阵乘法实现。使用递归树的深度优先（DFS）遍历，其中所有核并行工作以计算按顺序计算的N×N个子矩阵。 DFS减少了存储量，从而不利于收集和聚合结果的大数据运动。所提出的方法使用了三个优化：（1）一小组基本代数函数以减少开销；（2）调用基本函数的有效库（CUBLAS 5.5）；（3）使用参数内核的参数调整来提高资源占用率。 S-MM和W-MM的评估是在GPU和MIC（至强融核）上进行的。对于GPU，具有一个递归级别的W-MM和S-MM的性能优于CUBLAS 5.5库，对于满足N≥2048和N≥3072的阵列，其速度要快两倍。使用重新排序（R-S-MM）的S-MM观察到类似趋势，该趋势用于节省存储空间。与NVIDIA SDK库相比，上述阵列的S-MM和W-MM实现了20×到80×的加速。对于MIC，在N≥1024时，具有重排序的两次递归S-MM比MKL库快14-26％。建议的实现在GPU上达到2.35 TFLOPS（峰值的67％），在MIC上达到0.5 TFLOPS（峰值的21％）。。 16核Xeon-E5服务器获得了类似的令人鼓舞的结果。我们得出结论，具有少量递归级别的S-MM和W-MM实现可用于进一步优化基本代数库的性能。

著录项

来源
《International journal of parallel programming》 |2016年第4期|801-830|共30页
作者
Ayaz ul Hassan Khan; Mayez Al-Mouhamed; Allam Fatayer; Nazeeruddin Mohammad;
展开▼
作者单位

Computer Science Department, Qassim University, Almulyda-Qassim, Saudi Arabia;

Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia;

Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia;

Computer Engineering Department, Prince Mohammad Bin Fahd University, Dhahran, Saudi Arabia;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Graphics Processing Unit (GPU); CUDA programming; Strassen; Fast matrix multiplication;

机译：图形处理单元（GPU）;CUDA编程;Strassen;快速矩阵乘法;

相似文献

外文文献
中文文献
专利

1. Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms [J] . Tomonori Kouya JSIAM Letters . 2016,第2期

机译：使用并行Strassen和Winograd算法的多精度矩阵乘法性能评估
2. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms [J] . F. Desprez, F. Suter Concurrency and Computation . 2004,第8期

机译：混合并行度对Strassen和Winograd矩阵乘法算法的并行实现的影响
3. Accelerated multiple precision matrix multiplication using Strassen's algorithm and Winograd's variant [J] . Tomonori Kouya JSIAM Letters . 2014,第3期

机译：使用Strassen算法和Winograd变体加速多精度矩阵乘法
4. Mixed parallel implementations of the top level step of Strassen and Winograd matrix multiplication algorithms [C] . Desprez, F., Suter, . 2001

机译：Strassen和Winograd矩阵乘法算法的顶层步骤的混合并行实现
5. Sparse Matrix Multiplication on a Many-core Platform [D] . Shi, Peiyao. 2018

机译：许多核心平台上的稀疏矩阵乘法
6. Many-core algorithms for statistical phylogenetics [O] . Marc A. Suchard, Andrew Rambaut -1

机译：统计系统发生学的多核算法
7. Erratum to: Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core [O] . Ayaz ul Hassan Khan, Mayez Al-Mouhamed, Allam Fatayer, 2015

机译：错误到：优化使用STRASSEN和WINOGRAD算法的MATRIX乘法，许多核心的递归有限

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅