首页> 外文会议>Society of Petroleum Engineers Reservoir Simulation Symposium >Multi-GPU Parallelization of Nested Factorization for Solving Large Linear Systems
【24h】

Multi-GPU Parallelization of Nested Factorization for Solving Large Linear Systems

机译:用于求解大线性系统的嵌套分解的多GPU并行化

获取原文

摘要

We describe a massively parallel Nested Factorization (NF) linear solver for large systems of equations. NF is a powerful classic preconditioner receiving renewed attention due to its potential on emerging parallel architectures, especially Graphics Processing Units (GPUs). We build on the Massively Parallel NF (MPNF) framework described by Appleyard et al. (2011). MPNF divides the three- dimensional grid into ‘kernels’, assigns each kernel a color, such that no neighboring kernels share the same color. Parallelism is exploited by operating on all the kernels of a given color simultaneously and cycling through the NF operations color by color. Our MPNF algorithm is designed with special attention to asynchronous CPU-to-GPU memory transfer during the setup phase. Moreover, a CUDA-based BiCGStab Krylov solver and a customized ‘reduction kernel’ with greater bandwidth are used. The key features of the algorithm are: 1) a special ordering of the matrix elements that maximizes coalesced access to GPU global memory and speeds up kernel execution by several folds, 2) application of twisted factorization, which increases the number of concurrent threads at no additional cost, and (3) extension to multiple GPUs by first solving the so-called halo region in each GPU and overlapping peer-to-peer memory transfer between GPUs with solution of the interior regions. The GPU-based NF solver is demonstrated using several large problems, and we breakdown the performance details of all the algorithmic components. For the SPE10 model (highly heterogeneous with over one million cells) on a 512-core Tesla M2090 GPU, our implementation achieves a speed up of 26 for single-precision and 19 for double-precision computations compared with a single core of the Xeon X5660 CPU. Moreover, the (3072-core) 6-GPU solution of a highly refined SPE10 model (26.9 million cells) is more than five times faster than the single-GPU solution.
机译:我们描述了一种大规模并行嵌套分解(NF)的线性解算器方程组大型系统。 NF是一个功能强大的经典的预处理器重新受到关注,因为它在新兴的并行架构,尤其是图形处理单元(GPU)的潜力。我们建立在大规模并行通过阿普尔亚德等人描述NF(MPNF)框架。 (2011)。 MPNF将所述三维栅格成“内核”,给每个内核一种颜色,使得没有相邻的内核共享相同的颜色。并行通过在所有给定的颜色的内核同时运行,并通过颜色通过NF操作颜色循环利用。我们MPNF算法的设计要特别注意在安装三相异步CPU到GPU的内存传输。此外,基于CUDA的BICGSTAB克雷洛夫解算器和一个定制的“减少内核”具有更大的带宽被使用。该算法的主要特点是:1)矩阵元素的特殊排序最大化到GPU全局存储器聚结的访问,并加速通过数倍内核执行,2)扭曲因式分解的应用程序,它在不增加并发线程的数目附加成本,以及(3)扩展到多个GPU通过在每个GPU第一解决所谓晕区域和重叠的与所述内部区域的溶液GPU之间对等网络存储器传送。基于GPU的NF求解器使用几个大的问题表现出来,我们分解所有的算法组件的性能细节。有关512芯特斯拉M2090 GPU的SPE10模型(超过一百万的细胞高度异质的),我们的实施实现了加速的26单精度和19,用于双精度计算与至强X5660的单个芯相比中央处理器。此外,高度精制的SPE10模型(26.9百万个细胞)的(3072核心)-6- GPU溶液比单GPU解决方案快五倍以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号