首页> 外文期刊>Concurrency, practice and experience >Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
【24h】

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

机译:多GPU上的密集对称矩阵的三对角化及其在对称特征值问题中的应用

获取原文
获取原文并翻译 | 示例

摘要

For software to fully exploit the computing power of emerging heterogeneous computers, not only must thernrequired computational kernels be optimized for the specific hardware architectures but also an effectivernscheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communicationrnbetween them. As a case study, we develop a static scheduling scheme for the tridiagonalizationrnof a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a singlerncompute node.We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)-2 symmetricrnmatrix-vector multiplication, and the BLAS-3 low rank symmetric matrix updates on the GPUs.We demonstraternthe good scalability of these multi-GPU BLAS kernels and the effectiveness of our scheduling schemernon twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU-GPU kernelrninto computational kernels at higher-levels of software stacks, that is, a shared-memory dense eigensolverrnand a distributed-memory sparse eigensolver. Our experimental results show that our kernels greatly improvernthe performance of these higher-level kernels, not only reducing the solution time but also enabling the solutionrnof larger-scale problems. Because such symmetric eigenvalue problems arise in many scientific andrnengineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, theserndense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms.rnHence, they are not only important computational kernels on their own but also useful testbeds to study thernperformance of the emerging computers and the effects of the various optimization techniques.
机译:为了使软件能够充分利用新兴异构计算机的计算能力,不仅必须针对特定的硬件体系结构优化所需的计算内核,而且还需要一种有效的调度方案来利用可用的异构计算单元并隐藏它们之间的通信。作为案例研究,我们为在单个核算节点上具有多个图形处理单元(GPU)的多核CPU上的对称密集矩阵tridiagonalizationrno制定了静态调度方案,然后并行化和优化了基本线性代数子例程(BLAS)-2对称核矩阵向量乘法,以及GPU上的BLAS-3低秩对称矩阵更新。我们展示了这些多GPU BLAS内核的良好可扩展性以及我们的调度方案在12个Intel Xeon处理器和3个NVIDIA GPU上的有效性。然后,我们将混合CPU-GPU内核集成到更高级别的软件堆栈的计算内核中,即共享内存密集型本征求解器和分布式内存稀疏本征求解器。我们的实验结果表明,我们的内核大大提高了这些高级内核的性能,不仅减少了求解时间,而且还解决了更大规模的问题。由于这样的对称特征值问题出现在许多科学和工程仿真中,因此我们的内核有可能导致新的科学发现。此外,精巧的线性代数算法还具有其他算法可以找到的算法特征。因此,它们不仅是重要的计算内核,而且还是研究新兴计算机的性能以及各种优化技术的作用的有用试验床。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号