Optimizing the hypre solver for manycore and GPU architectures

Sahasrabudhe Damodar; Zambre Rohit; Chandramowlishwaran Aparna; Berzins Martin

摘要

The solution of large-scale combustion problems with codes such as Uintah on modern computer architectures requires the use of multithreading and GPUs to achieve performance. Uintah uses a low-Mach number approximation that requires iteratively solving a large system of linear equations. The Hypre iterative solver has solved such systems in a scalable way for Uintah, but the use of OpenMP with Hypre leads to at least 2x slowdown due to OpenMP overheads. The proposed solution uses the MPI Endpoints within Hypre, where each team of threads acts as a different MPI rank. This approach minimizes OpenMP synchronization overhead and performs as fast or (up to 1.44 x ) faster than Hypre's MPI-only version, and allows the rest of Uintah to be optimized using OpenMP. The profiling of the GPU version of Hypre shows the bottleneck to be the launch overhead of thousands of micro-kernels. The GPU performance was improved by fusing these micro-kernels and was further optimized by using Cuda-aware MPI, resulting in an overall speedup of 1.16-1.44x compared to the baseline GPU implementation.The above optimization strategies were published in the International Conference on Computational Science 2020 [1]. This work extends the previously published research by carrying out the second phase of communication-centered optimizations in Hypre to improve its scalability on large-scale supercomputers. This includes an efficient non-blocking inter-thread communication scheme, communication-reducing patch assignment, and expression of logical communication parallelism to a new version of the MPICH library that utilizes the underlying network parallelism [2]. The above optimizations avoid communication bottlenecks previously observed during strong scaling and improve performance by up to 2x on 256 nodes of Intel Knight's Landing processor.

机译：大规模燃烧问题与uintah等uintah上的核心燃烧问题要求使用多线程和GPU来实现性能。 UINTAD使用低马赫数近似，需要迭代地解决大型线性方程系统。 Hypre迭代求解器以可扩展的方式为UINTAH解决了这些系统，但是由于OpenMP开销，使用Hypre的OpenMP与Hypre的使用导致至少2倍减速。所提出的解决方案在Hypre中使用MPI端点，其中每个线程团队充当不同的MPI等级。此方法可最大限度地减少OpenMP同步开销，并比Hypre的MPI版本快速或（最多1.44 x）执行，并且允许使用OpenMP优化Uintah的其余部分。 Hypre的GPU版本的分析显示了瓶颈是成千上万的微内核的发射开销。通过融合这些微内核来改善GPU性能，并通过使用Cuda-Ippare MPI进一步优化，导致与基线GPU实施相比的总速速为1.16-1.44倍。上述优化策略在国际会议上发表了计算科学2020 [1]。这项工作通过在Hypre中执行连通中心优化的第二阶段来扩展先前发布的研究，以提高其在大型超级计算机上的可扩展性。这包括有效的非阻挡线程间通信方案，通信减少补丁分配和逻辑通信并行性的表达到利用底层网络并行性的MPICH库的新版本[2]。上述优化避免了先前在强大的缩放期间观察到的通信瓶颈，并在英特尔骑士着陆处理器的256个节点上提高了2倍的性能。

Optimizing the hypre solver for manycore and GPU architectures

摘要

著录项

引文网络

相关主题

期刊订阅