...
首页> 外文期刊>International journal of parallel programming >Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core
【24h】

Optimizing the Matrix Multiplication Using Strassen and Winograd Algorithms with Limited Recursions on Many-Core

机译:在多核上使用有限递归的Strassen和Winograd算法优化矩阵乘法

获取原文
获取原文并翻译 | 示例
           

摘要

Many-core systems are basically designed for applications having large data parallelism. We propose an efficient hybrid matrix multiplication implementation based on Strassen and Winograd algorithms (S-MM and W-MM) on many-core. A depth first (DFS) traversal of a recursion tree is used where all cores work in parallel on computing each of the N × N sub-matrices, which are computed in sequence. DFS reduces the storage to the detriment of large data motion to gather and aggregate the results. The proposed approach uses three optimizations: (1) a small set of basic algebra functions to reduce overhead, (2) invoking efficient library (CUBLAS 5.5) for basic functions, and (3) using parameter-tuning of parametric kernel to improve resource occupancy. Evaluation of S-MM and W-MM is carried out on GPU and MIC (Xeon Phi). For GPU, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as fast for arrays satisfying N ≥ 2048 and N ≥ 3072, respectively. Similar trends are observed for S-MM with reordering (R-S-MM), which is used to save storage. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20 × and 80 × for the above arrays. For MIC, two-recursion S-MM with reordering is faster than MKL library by 14-26% for N ≥ 1024. Proposed implementations achieve 2.35 TFLOPS (67% of peak) on GPU and 0.5 TFLOPS (21 % of peak) on MIC. Similar encouraging results are obtained for a 16-core Xeon-E5 server. We conclude that S-MM and W-MM implementations with a few recursion levels can be used to further optimize the performance of basic algebra libraries.
机译:多核系统基本上是为具有大数据并行性的应用程序而设计的。我们提出了一种基于Strassen和Winograd算法(S-MM和W-MM)的多核高效混合矩阵乘法实现。使用递归树的深度优先(DFS)遍历,其中所有核并行工作以计算按顺序计算的N×N个子矩阵。 DFS减少了存储量,从而不利于收集和聚合结果的大数据运动。所提出的方法使用了三个优化:(1)一小组基本代数函数以减少开销;(2)调用基本函数的有效库(CUBLAS 5.5);(3)使用参数内核的参数调整来提高资源占用率。 S-MM和W-MM的评估是在GPU和MIC(至强融核)上进行的。对于GPU,具有一个递归级别的W-MM和S-MM的性能优于CUBLAS 5.5库,对于满足N≥2048和N≥3072的阵列,其速度要快两倍。使用重新排序(R-S-MM)的S-MM观察到类似趋势,该趋势用于节省存储空间。与NVIDIA SDK库相比,上述阵列的S-MM和W-MM实现了20×到80×的加速。对于MIC,在N≥1024时,具有重排序的两次递归S-MM比MKL库快14-26%。建议的实现在GPU上达到2.35 TFLOPS(峰值的67%),在MIC上达到0.5 TFLOPS(峰值的21%)。 。 16核Xeon-E5服务器获得了类似的令人鼓舞的结果。我们得出结论,具有少量递归级别的S-MM和W-MM实现可用于进一步优化基本代数库的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号