首页> 外文期刊>Journal of supercomputing >Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster
【24h】

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster

机译:GPU簇上基于模板的代码的通信避免GMRES方法的实现和性能评估

获取原文
获取原文并翻译 | 示例
           

摘要

In this study, a communication-avoiding generalized minimum residual method (CA-GMRES) is implemented on a hybrid CPU-GPU cluster, targeted for the performance acceleration of iterative linear system solver in the gyrokinetic toroidal five-dimensional Eulerian code (GT5D). In the GT5D, its sparse matrix-vector multiplication operation (SpMV) is performed as a 17-point stencil-based computation. The specialized part for the GT5D is only in the SpMV, and the other parts are usable also for other application program codes. In addition to the CA-GMRES, we implement and evaluate a modified variant of CA-GMRES (M-CA-GMRES) proposed in the previous study Idomura et al. (in: Proceedings of the 8th workshop on latest advances in scalable algorithms for large-scale systems (ScalA '17), 2017. https://doi.org/10.1145/3148226.3148234) to reduce the amount of floating-point calculations. This study demonstrates that beneficial features of the CA-GMRES are in its minimum number of collective communications and its highly efficient calculations based on dense matrix-matrix operations. The performance evaluation is conducted on the Reedbush-L GPU cluster, which contains four NVIDIA Tesla P100 (Pascal GP100) GPUs per compute node. The evaluation results show that the M-CA-GMRES or CA-GMRES for the GT5D is advantageous over the GMRES or the generalized conjugate residual method (GCR) on GPU clusters, especially when the problem size (vector length) is large so that the cost of the SpMV is less dominant. The M-CA-GMRES is 1.09 x, 1.22 x and 1.50 x faster than the CA-GMRES, GCR and GMRES, respectively, when 64 GPUs are used.
机译:在这项研究中,在混合CPU-GPU集群上实现了一种避免通信的通用最小残差方法(CA-GMRES),其目标是提高动圈式环形五维欧拉编码(GT5D)中迭代线性系统求解器的性能。在GT5D中,其稀疏矩阵矢量乘法运算(SpMV)作为基于17点模板的计算来执行。 GT5D的专用部分仅在SpMV中,其他部分也可用于其他应用程序代码。除了CA-GMRES,我们还实施和评估了Idomura等人在先前的研究中提出的CA-GMRES的改进变体(M-CA-GMRES)。 (参见:第八届大规模系统可伸缩算法最新进展研讨会论文集(ScalA '17),2017年。https://doi.org/10.1145/3148226.3148234),以减少浮点计算量。这项研究表明,CA-GMRES的有益特征在于其最少的集体通信数量以及基于密集矩阵矩阵运算的高效计算。性能评估是在Reedbush-L GPU集群上进行的,该集群每个计算节点包含四个NVIDIA Tesla P100(Pascal GP100)GPU。评估结果表明,GT5D的M-CA-GMRES或CA-GMRES在GPU群集上优于GMRES或广义共轭残差法(GCR),尤其是当问题大小(矢量长度)很大时, SpMV的成本占主导地位。当使用64个GPU时,M-CA-GMRES分别比CA-GMRES,GCR和GMRES快1.09倍,1.22倍和1.50倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号