...
首页> 外文期刊>ACM Transactions on Parallel Computing >Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors
【24h】

Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors

机译:用于图形处理器的无矩阵高阶有限元计算的Multigrid

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

This article presents matrix-free finite-element techniques for efficiently solving partial differential equations on modern many-core processors, such as graphics cards. We develop a GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes. The central algorithmic component is the matrix-free operator evaluation with sum factorization. We compare the node-level performance of our implementation running on an Nvidia Pascal P100 GPU to a highly optimized multicore implementation running on comparable Intel Broadwell CPUs and an Intel Xeon Phi. Our experiments show that the GPU implementation is approximately 1.5 to 2 times faster across four different scenarios of the Poisson equation and a variety of element degrees in 2D and 3D. The lowest time to solution per degree of freedom is recorded for moderate polynomial degrees between 3 and 5. A detailed performance analysis highlights the capabilities of the GPU architecture and the chosen execution model with threading within the element, particularly with respect to the evaluation of the matrix-vector product. Atomic intrinsics are shown to provide a fast way for avoiding the possible race conditions in summing the elemental residuals into the global vector associated to shared vertices, edges, and surfaces. In addition, the solver infrastructure allows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precision with an outer correction in double precision, increasing throughput by up to 83%.
机译:本文介绍了无矩阵的有限元技术,可以有效地解决现代多核处理器(例如图形卡)上的偏微分方程。我们针对中高阶多项式开发了无矩阵几何多重网格迭代求解器的GPU并行化,并支持带有悬挂节点的一般曲面和自适应精制六面体网格。核心算法组件是具有求和因子分解的无矩阵算子评估。我们将在Nvidia Pascal P100 GPU上运行的实现与在可比较的Intel Broadwell CPU和Intel Xeon Phi上运行的高度优化的多核实现进行了节点级性能比较。我们的实验表明,在Poisson方程的四种不同情况以及2D和3D中各种元素度的情况下,GPU的实现速度大约快1.5到2倍。对于3到5之间的中等多项式,记录了每个自由度最少的求解时间。详细的性能分析着重介绍了GPU架构的功能以及所选择的执行模型以及元素内的线程,尤其是在评估性能方面。矩阵向量积。在将元素残差求和到与共享顶点,边和曲面关联的全局矢量中时,显示出原子内在函数为避免可能的竞争条件提供了一种快速方法。此外,求解器基础架构允许使用混合精度算法,该算法以单精度执行多网格V循环,以双精度进行外部校正,从而将吞吐量提高了83%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号