Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

机译：用柱循环分布在多核和GPU处理器上使用柱枢转的QR分解的并行化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The QR decomposition with column pivoting (QRP) of a matrix is widely used for rank revealing. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1). Topics. Parallel and Distributed Computing.

机译：矩阵的柱枢转（QRP）的QR分解广泛用于排名。 HAPACK实现（DGEQP3）的QRP算法的性能受更新列规范所需的2级BLAS操作的限制。在本文中，我们提出了使用矩阵列的分布以循环方式的分布来实现QRP算法，以获得多核架构上的更好的数据局势和并行存储器总线利用。我们的绩效结果显示了Intel MKL（版本10.3）的常规DGEQ3在12 Core Intel Xeon X5670机器上的60％改进。此外，我们表明相同的数据分布也适用于通用GPU处理器，我们的实现在NVIDIA GeForce GTX480上获得高达90 GFLOPS。这比Magma的QRP执行速度快2倍（版本1.2.1）。话题。并行和分布式计算。

著录项

来源
《Tutorial on High Performance Numerical Tools for the Development and Scalability of High-End Computer Applications Conference》|2013年||共9页
会议地点
作者
Andres Tomas; Zhaojun Bai; Vicente Hernandez;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301-53;
关键词

相似文献

外文文献
中文文献
专利

1. High Performance Parallelization of COMPSYN on a Cluster of Multicore Processors with GPUs [J] . Ferdinando Alessi, Annalisa Massini, Roberto Basili Procedia Computer Science . 2012,第1期

机译：具有GPU的多核处理器集群上的COMPSYN高性能并行化
2. Two-Stage Least Squares Algorithms with QR Decomposition for Simultaneous Equations Models on Heterogeneous Multicore and Multi-GPU Systems [J] . Carla Ramiro, Jose J. López-Espín, Domingo Giménez, Procedia Computer Science . 2012,第1期

机译：异构多核和多GPU系统上联立方程模型的QR分解两阶段最小二乘算法
3. Implementations of a Parallel Algorithm for Computing Euclidean Distance Map in Multicore Processors and GPUs [J] . Duhu Man, Kenji Uda, Hironobu Ueyama, International Journal of Networking and Computing . 2011,第2期

机译：多核处理器和GPU中并行计算欧氏距离图的算法的实现
4. Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors [C] . Andres Tomas, Zhaojun Bai, Vicente Hernandez International conference on high performance computing for computational science . 2013

机译：使用多核和GPU处理器上的列循环分布使用列透视进行QR分解的并行化
5. S-Orthogonal QR Decomposition Algorithms on Multicore Systems. [D] . Zhao, Jia Qi. 2013

机译：多核系统上的S正交QR分解算法。
6. Parallel Digital Watermarking Process on Ultrasound Medical Images in Multicores Environment [O] . Hui Liang Khor, Siau-Chuin Liew, Jasni Mohd. Zain 2016

机译：多核环境中超声医学图像的并行数字水印处理
7. Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors [O] . Andrés Tomás, Zhaojun Bai, Vicente Hernández 2013

机译：使用多核和GPU处理器上的列循环分布使用列旋转进行QR分解的并行化
8. Communication Avoiding Rank Revealing QR Factorization with Column Pivoting. [R] . Demmel, J. W., Grigori, L., Gu, M., 2013

机译：通过列透视避免秩分解显示QR分解。

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

摘要

著录项

相似文献

相关主题

期刊订阅