首页> 外文期刊>ACM Transactions on Parallel Computing >Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors
【24h】

Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors

机译:用于图形处理器上的无矩阵高阶有限元计算的多重资源

获取原文
获取原文并翻译 | 示例

摘要

This article presents matrix-free finite-element techniques for efficiently solving partial differential equations on modern many-core processors, such as graphics cards. We develop a GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes. The central algorithmic component is the matrix-free operator evaluation with sum factorization. We compare the node-level performance of our implementation running on an Nvidia Pascal P100 GPU to a highly optimized multicore implementation running on comparable Intel Broadwell CPUs and an Intel Xeon Phi. Our experiments show that the GPU implementation is approximately 1.5 to 2 times faster across four different scenarios of the Poisson equation and a variety of element degrees in 2D and 3D. The lowest time to solution per degree of freedom is recorded for moderate polynomial degrees between 3 and 5. A detailed performance analysis highlights the capabilities of the GPU architecture and the chosen execution model with threading within the element, particularly with respect to the evaluation of the matrix-vector product. Atomic intrinsics are shown to provide a fast way for avoiding the possible race conditions in summing the elemental residuals into the global vector associated to shared vertices, edges, and surfaces. In addition, the solver infrastructure allows for using mixed-precision arithmetic that performs the multigrid V-cycle in single precision with an outer correction in double precision, increasing throughput by up to 83%.
机译:本文介绍了无矩阵有限元技术,用于有效地求解现代许多核心处理器的局部微分方程,例如显卡。我们开发了无矩阵几何多项式迭代求解器的GPU并行化靶向中等和高多项式度,具有悬挂节点的一般弯曲和适自适应的六面向网格的支持。中央算法组件是与总和分解的无矩阵操作员评估。我们比较我们在NVIDIA Pascal P100 GPU上运行的节点级性能,以在可比较的英特尔Broadwell CPU和Intel Xeon Phi上运行的高度优化的多核实现。我们的实验表明,在泊松方程的四种不同场景和2D和3D中的各种元素度,GPU实现大约为1.5至2倍。在3到5之间的中度多项式度记录每个自由度的最低时间。详细的性能分析突出了GPU架构和所选的执行模型在元素内具有线程的功能,特别是关于评估矩阵矢量产品。示出了原子内在机构以提供一种快速方式,用于避免可能的竞争条件在与共享顶点,边缘和表面相关联的全局矢量中求解元素剩余。此外,求解器基础架构允许使用混合精度算术,以单精度执行多重型V周期,以双重精度为外部校正,将吞吐量提高高达83%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号