【24h】

An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors

机译:块共轭梯度算法在CPU-GPU处理器上的实现

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.
机译:在本文中,我们研究了CPU-GPU处理器上块共轭梯度(BCG)算法的实现。通过分析BCG中各种矩阵运算的性能,我们确定了构造新的搜索方向矩阵时的主要性能瓶颈。通过小矩阵的本征分解代替QR分解通过减少生成正交搜索方向的计算成本来解决该问题。此外,设计了一种混合(卸载)计算方案,以使BCG实现能够处理线性系统,这些线性系统具有无法容纳在GPU内存中的大而稀疏的系数矩阵。混合方案将矩阵运算转移给GPU处理器,同时有助于隐藏CPU-GPU内存事务开销。我们将使用自动卸载模式的BCG实施与采用Intel Xeon Phi协处理器的CPU实施的性能进行比较。有了足够的右侧,BCG的CPU-GPU实现可以比仅CPU的实现达到2.61的加速,这明显高于CPU-Intel Xeon Phi的实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号