Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures

机译：在多核架构上重新排序的投机平行反向切割 - McKee重新排序

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.

机译：稀疏矩阵的带宽减少用于减少线性溶剂的填充，并提高其他稀疏矩阵操作的性能，例如迭代溶剂中的稀疏矩阵矢量乘法。为了计算带宽减少置换，通常应用反向切割 - McKee（RCM）重新排序，这是对并行化的具有挑战性，因为其核心本质上是串行的。与GPU一样多的核心架构提供子PL单线程性能，并且通常仅通过慢速存储器总线连接到高性能CPU内核，既不计算GPU上的RCM也不将数据移动到CPU是可行的选项。然而，重新排序的矩阵可能是在操作之间的多次运营中可能对高吞吐量至关重要。仍然，据我们所知，我们是第一个可以在多核CPU和许多核心GPU上执行的RCM实现的，使计算到数据而不是反之亦然。我们算法并将RCM并行化为大多数独立批次节点。对于每批批处理，单个CPU-Thread / A GPU线程块投机地发现子节点并根据RCM算法对其进行排序。在编写置换之前，我们重新评估发现并建立新批次。为了增加并行性并减少依赖性，我们沿着连续批次创建信号链，并引入早期信令条件。结合并行工作队列，新批处理按顺序启动，结果RCM排列与地面真值单线式算法相同.WE提出了在GPU上运行的第一个RCM实现。它通过NVIDIA的单线式Cusolver RCM实现实现了几种数量级速度，并且比以前的并行CPU方法更快。我们的结果对于许多核心架构尤为重要，因为现在可以将RCM重新排序到稀疏矩阵操作的序列，而无需重大性能损失。

著录项

来源
《IEEE International Parallel and Distributed Processing Symposium》|2021年|703-713|共11页
会议地点
作者
Daniel Mlakar; Martin Winter; Mathias Parger; Markus Steinberger;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Runtime; Multicore processing; Graphics processing units; Computer architecture; Bandwidth; Parallel processing; Writing;

机译：运行时;多核处理;图形处理单元;计算机架构;带宽;并行处理;写作;

相似文献

外文文献
中文文献
专利

1. Numerical reproducibility for the parallel reduction on multi- and many-core architectures [J] . Collange Sylvain, Defour David, Graillat Stef, Parallel Computing . 2015,第NOVa期

机译：在多核和多核体系结构上并行缩减的数值可再现性
2. Parallel HEVC Decoding on Multi- and Many-core Architectures [J] . Chi Ching Chi, Mauricio Alvarez-Mesa, Jan Lucas, Journal of Signal Processing Systems . 2013,第3期

机译：多核和多核架构上的并行HEVC解码
3. Parallel HEVC Decoding on Multi- and Many-core Architectures [J] . Chi Ching Chi, Mauricio Alvarez-Mesa, Jan Lucas, Journal of signal processing systems for signal, image, and video technology . 2013,第3期

机译：多核和多核架构上的并行HEVC解码
4. A non-speculative parallelization of reverse cuthill-McKee algorithm for sparse matrices reordering [C] . Thiago Nascimento Rodrigues, Maria Claudia Silva Boeres, Lucia Catabriga Federated Conference on Computer Science and Information Systems . 2017

机译：逆Cuthill-McKee算法的非推测并行化，用于稀疏矩阵重排序
5. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. [D] . Cintra, Marcelo Hehl. 2001

机译：对共享内存多处理器中的可伸缩投机并行化的体系结构支持。
6. MC64-ClustalWP2: A Highly-Parallel Hybrid Strategy to Align Multiple Sequences in Many-Core Architectures [O] . David Díaz, Francisco J. Esteban, Pilar Hernández, -1

机译：MC64-ClustalWP2：一种高度并行的混合策略用于在多核体系结构中对齐多个序列
7. A Non-Speculative Parallelization of Reverse Cuthill-McKee Algorithm for Sparse Matrices Reordering [O] . Thiago Nascimento Rodrigues, Maria Claudia Silva Boeres, Lucia Catabriga 2017

机译：稀疏矩阵重新排序的反向切割-Cmke算法的非推测性并行化

Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures

摘要

著录项

相似文献

相关主题

期刊订阅