首页> 外文期刊>Neural, Parallel & Scientific Computations >Memory Hierarchy Exploration For Accelerating The Parallel Computation Of Svds
【24h】

Memory Hierarchy Exploration For Accelerating The Parallel Computation Of Svds

机译:加快SVD并行计算的内存层次结构探索

获取原文
获取原文并翻译 | 示例

摘要

The performance of many applications on modern computers is often limited by memory latency rather than by processor speed. For computers with memory hierarchy, it is preferable to perform the computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. This paper proposes a fast algorithm for parallel computing the extremely useful singular value decomposition (SVD) based on one-sided Jacobi on multi-level memory hierarchy architectures. On P parallel processors, the given matrix is divided into super-rows and then these super-rows are partitioned into 2P blocks. One key point of the proposed algorithm is the highly exploitation of memory hierarchy by performing all computations on super-rows loaded in cache memory rather than on rows. Another key point is that the number of sweeps required for convergence is very close to cyclic one-sided Jacobi. Third key point of the proposed algorithm is that the number of sweeps required for convergence does not depend drastically on the size of the input matrix. On two dual-core Intel Xeon processors, our results show that the performance of parallel implementation of the proposed algorithm is around 11 times higher than the sequential implementation on the same hardware. Moreover, a performance of around 10 GFLOPS (double-precision) can be achieved on the target system using multi-threading, Intel SIMD instructions, matrix blocking, and loop unrolling techniques.
机译:现代计算机上许多应用程序的性能通常受内存延迟而不是处理器速度的限制。对于具有内存层次结构的计算机,最好对数据块执行计算,以通过重用高速缓存中的已加载数据来减少内存延迟的影响。本文提出了一种基于多层次存储层次结构的单面Jacobi并行计算极其有用的奇异值分解(SVD)的快速算法。在P个并行处理器上,将给定的矩阵划分为多个超级行,然后将这些超级行划分为2P个块。所提出算法的关键点是通过对缓存中加载的超行而不是对行执行所有计算来高度利用内存层次结构。另一个关键点是收敛所需的扫描次数非常接近循环单面Jacobi。提出的算法的第三个关键点是收敛所需的扫描次数并不完全取决于输入矩阵的大小。在两个双核Intel Xeon处理器上,我们的结果表明,该算法的并行实现性能比相同硬件上的顺序实现性能高11倍左右。此外,使用多线程,Intel SIMD指令,矩阵阻塞和循环展开技术,可以在目标系统上实现约10 GFLOPS(双精度)的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号