首页> 外文期刊>Parallel Computing >Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors
【24h】

Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors

机译:可变大小的批量高斯-乔丹消除算法,用于图形处理器上的块-雅各比预处理

获取原文
获取原文并翻译 | 示例

摘要

In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.
机译:在这项工作中,我们解决了在图形处理单元(GPU)上高效实现Block-Jacobi预处理的问题。该任务需要解决一组小型且独立的线性系统的问题。为了完全实现此实现,我们开发了使用高斯-乔丹消除(GJE)的可变大小批处理矩阵求逆内核,以及将线性系统右侧转换为解矢量的可变大小的批处理矩阵矢量乘法内核。我们的内核大量使用了增加的寄存器数量以及与更新的GPU架构相关的局部扭曲通信。此外,在矩阵求逆中,我们采用了隐式数据透视策略,该策略将工作负载(即操作)迁移到数据所在的位置,而不是将数据移动到执行核心。我们通过提取和插入策略对矩阵求逆进行补充,从而可以快速设置块雅各比预处理器。在NVlDlA的K40和P100体系结构上进行的实验表明,我们的可变大小批量矩阵求逆例程优于提供相同(甚至更少)功能的CUDA基本线性代数子例程(cuBLAS)库功能。我们还表明,迭代器的更快收敛可以稍微抵消预处理器的设置和预处理器的应用程序成本。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号