We describe a massively parallel Nested Factorization (NF) linear solver for large systems of equations. NF is a powerful classic preconditioner receiving renewed attention due to its potential on emerging parallel architectures, especially Graphics Processing Units (GPUs). We build on the Massively Parallel NF (MPNF) framework described by Appleyard et al. (2011). MPNF divides the three- dimensional grid into ‘kernels’, assigns each kernel a color, such that no neighboring kernels share the same color. Parallelism is exploited by operating on all the kernels of a given color simultaneously and cycling through the NF operations color by color. Our MPNF algorithm is designed with special attention to asynchronous CPU-to-GPU memory transfer during the setup phase. Moreover, a CUDA-based BiCGStab Krylov solver and a customized ‘reduction kernel’ with greater bandwidth are used. The key features of the algorithm are: 1) a special ordering of the matrix elements that maximizes coalesced access to GPU global memory and speeds up kernel execution by several folds, 2) application of twisted factorization, which increases the number of concurrent threads at no additional cost, and (3) extension to multiple GPUs by first solving the so-called halo region in each GPU and overlapping peer-to-peer memory transfer between GPUs with solution of the interior regions. The GPU-based NF solver is demonstrated using several large problems, and we breakdown the performance details of all the algorithmic components. For the SPE10 model (highly heterogeneous with over one million cells) on a 512-core Tesla M2090 GPU, our implementation achieves a speed up of 26 for single-precision and 19 for double-precision computations compared with a single core of the Xeon X5660 CPU. Moreover, the (3072-core) 6-GPU solution of a highly refined SPE10 model (26.9 million cells) is more than five times faster than the single-GPU solution.
展开▼