Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

机译：高性能共轭梯度基准的共享内存实现及其在非结构化矩阵中的应用

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

机译：最近发布了一种新的稀疏高性能共轭梯度基准（HPCG），以解决面向下一代极端规模计算系统的稀疏线性求解器的设计挑战。 HPCG中的关键计算，数据访问和通信模式代表了当今HPC应用程序中常见的构建基块。有效地并行化HPCG中最耗时的Gauss-Seidel平滑器是一个众所周知的挑战，我们的算法和体系结构优化分别在Xeon和Xeon Phi上提供了可实现带宽的95％和68％。基于可用的并行性，我们的Xeon Phi Gauss-Seidel平滑器共享内存实现选择性地应用了块多色重新排序。与MPI并行化相结合，我们的实现在并行性，数据访问位置，CG收敛速度和通信开销之间取得了平衡。我们的实施在Tianhe-2系统上实现了580 TFLOPS（并行化效率为82％），在2014年7月的最新HPCG列表中排名第一。此外，我们证明了我们的优化不仅有益于基于结构化3D的HPCG原始数据集网格，以及各种各样的非结构化矩阵。

著录项

来源
《International Conference for High Performance Computing, Networking, Storage and Analysis》|2014年|945-955|共11页
会议地点
作者
Jongsoo Park; Smelyanskiy Mikhail; Vaidyanathan Karthikeyan; Heinecke Alexander; Kalamkar Dhiraj D.; Xing Liu; Patwary M. Mostofa Ali; Yutong Lu; Dubey Pradeep;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
conjugate gradient methods; iterative methods; matrix algebra; message passing; optimisation; parallel processing; shared memory systems; 3D grid; CG convergence rate; Gauss-Seidel smoother parallelization; HPC applications; HPCG; MPI parallelization; TFLOPS; Tianhe-2 system; Xeon Phi shared-memory implementation; algorithmic optimizations; architecture-aware optimizations; block multicolor reordering; communication overhead; communication pattern; data access locality; high performance conjugate gradient benchmark; next generation extreme-scale computing systems; parallelism; sparse linear solvers; unstructured matrices; Benchmark testing; Convergence; Equations; Parallel processing; Sparse matrices; Synchronization; Vectors;

机译：共轭梯度法;迭代法;矩阵代数;消息传递;优化;并行处理;共享存储系统; 3D网格; CG收敛速度; Gauss-Seidel平滑并行化; HPC应用; HPCG; MPI并行化; TFLOPS; Tianhe-2系统; Xeon Phi共享内存实现;算法优化;体系结构感知优化;块多色重新排序;通信开销;通信模式;数据访问局部性;高性能共轭梯度基准;下一代超大规模计算系统;并行度;稀疏线性求解器;非结构化矩阵;基准测试;收敛;方程;并行处理;稀疏矩阵;同步;向量;

相似文献

外文文献
中文文献
专利

1. Paralleization Strategies for Element-by-Element Proconditioned Conjugate Gradient Solver Using High-Performance Fortran for Unstructured Finite-Element Applications on Linux Clusters [J] . Ganesh Thiagarajan, Vibhas Aravamuthan Journal of Computing in Civil Engineering . 2002,第1期

机译：针对Linux集群上非结构化有限元应用的高性能Fortran逐元素预处理共轭梯度求解器的并行化策略
2. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems [J] . Dongarra Jack, Heroux Michael A., Luszczek Piotr Experimental Mechanics . 2016,第1期

机译：高性能共轭梯度基准：一种用于对高性能计算系统进行排名的新指标
3. Optimizations in a high-performance conjugate gradient benchmark for IA-based multi- and many-core processors [J] . Park Jongsoo, Smelyanskiy Mikhail, Vaidyanathan Karthikeyan, Experimental Mechanics . 2016,第1期

机译：针对基于IA的多核和多核处理器的高性能共轭梯度基准测试的优化
4. A CUDA Implementation of the High Performance Conjugate Gradient Benchmark [C] . Everett Phillips, Massimiliano Fatica International workshop on performance modeling, benchmarking, and simulation of high-performance computing systems;ACM/IEEE international conference for high-performance computing, networking, storage and analysis . 2015

机译：高性能共轭梯度基准的CUDA实现
5. Conjugate reducibility of families of block-diagonal matrices over an extension field of a perfect field, and applications to matrix subalgebras and subgroups. [D] . Brock, Martin L. 2004

机译：在理想域的扩展域上结合块对角矩阵族的可约性，并将其应用于矩阵子代数和子群。
6. Formulation of a Model Resin System for Benchmarking Processing-Property Relationships in High-Performance Photo 3D Printing Applications [O] . Jianwei Tu, Kamran Makarian, Nicolas J. Alvarez, 2020

机译：用于高性能照片3D打印应用中的加工性质关系的模型树脂系统的制定
7. A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices [O] . Roldao A, Constantinides GA 2010

机译：基于FpGa的高吞吐量密集矩阵浮点共轭梯度实现
8. Toward Efficient Implementations of PCCG (Preconditioned Conjugate Gradient) Methods on Vector Supercomputers [R] . Melhem, R. 1986

机译：向量超级计算机的pCCG（预条件共轭梯度）方法的有效实现

Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices

摘要

著录项

相似文献

相关主题

期刊订阅