首页> 外文会议>IEEE International Parallel and Distributed Processing Symposium >Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures
【24h】

Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures

机译:在多核架构上重新排序的投机平行反向切割 - McKee重新排序

获取原文

摘要

Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa.Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm.We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA's single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.
机译:稀疏矩阵的带宽减少用于减少线性溶剂的填充,并提高其他稀疏矩阵操作的性能,例如迭代溶剂中的稀疏矩阵矢量乘法。为了计算带宽减少置换,通常应用反向切割 - McKee(RCM)重新排序,这是对并行化的具有挑战性,因为其核心本质上是串行的。与GPU一样多的核心架构提供子PL单线程性能,并且通常仅通过慢速存储器总线连接到高性能CPU内核,既不计算GPU上的RCM也不将数据移动到CPU是可行的选项。然而,重新排序的矩阵可能是在操作之间的多次运营中可能对高吞吐量至关重要。仍然,据我们所知,我们是第一个可以在多核CPU和许多核心GPU上执行的RCM实现的,使计算到数据而不是反之亦然。我们算法并将RCM并行化为大多数独立批次节点。对于每批批处理,单个CPU-Thread / A GPU线程块投机地发现子节点并根据RCM算法对其进行排序。在编写置换之前,我们重新评估发现并建立新批次。为了增加并行性并减少依赖性,我们沿着连续批次创建信号链,并引入早期信令条件。结合并行工作队列,新批处理按顺序启动,结果RCM排列与地面真值单线式算法相同.WE提出了在GPU上运行的第一个RCM实现。它通过NVIDIA的单线式Cusolver RCM实现实现了几种数量级速度,并且比以前的并行CPU方法更快。我们的结果对于许多核心架构尤为重要,因为现在可以将RCM重新排序到稀疏矩阵操作的序列,而无需重大性能损失。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号