首页> 外文会议>2011 25th IEEE International Parallel Distributed Processing Symposium >A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method
【24h】

A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method

机译:一种避免通信的,混合并行,秩公开的正交化方法

获取原文

摘要

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, "communication" includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches for orthogonalizing the vectors within each block ("normalization"). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5 -- 20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.
机译:正交化会耗费许多迭代方法来解决稀疏线性系统和特征值问题的大量运行时间。常用算法(例如Gram-Schmidt或Householder QR的变体)的性能以通信为主导。在此,“通信”既包括CPU与存储器之间的数据移动,也包括处理器之间的并行消息。我们的高瘦QR(TSQR)系列算法与典型的正交化方法相比,在处理器之间渐进地减少消息数量以及在CPU和内存之间进行数据移动所需的渐近次数更少,但达到了与Householder QR分解相同的准确性。此外,在块正交化中,TSQR比用于使每个块内的向量正交化的现有方法更快和更准确(“归一化”)。 TSQR的排名显示功能还使其可用于检测块迭代方法中的放气,而对于这些方法,现有方法会牺牲性能,准确性或两者兼而有之。我们已经实现了TSQR版本,该版本可同时利用分布式内存和共享内存并行性,并支持实数和复数算术。我们的实现针对正交化少量(5-20​​)非常长的向量的情况进行了优化。共享内存并行组件使用英特尔的线程构建模块,尽管其模块化设计也支持其他共享内存编程模型,包括GPU上的计算。与竞争性正交化相比,我们的实现可将速度提高2倍以上。它现在可在Trilinos软件包的开发分支中获得,并将包含在10.8版本中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号