首页> 外文期刊>Cluster computing >Optimizing tensor contraction expressions for hybrid CPU-GPU execution
【24h】

Optimizing tensor contraction expressions for hybrid CPU-GPU execution

机译:优化张量收缩表达式以执行混合CPU-GPU

获取原文
获取原文并翻译 | 示例
           

摘要

Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on Graphics Processing Units (GPUs) requires several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. Moreover, to apply the same optimizations to various expressions, we need a code generation tool. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. To evaluate our tool, GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NWChem, a popular computational chemistry suite. For this method, we demonstrate speedup over a factor of 8. 4 using one GPU as compared to one CPU core and over 2. 6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores (instead of 7 cores per node). We further investigate tensor contraction code on a new series of GPUs, the Fermi GPUs, and provide several effective optimization algorithms. For the same computation of CCSD(T), on a cluster with Fermi GPUs, we achieve a speedup of 3. 4 over a cluster with T10 GPUs. With a single Fermi GPU on each node, we achieve a speedup of 43 over the sequential CPU version.
机译:张量收缩是广义的多维矩阵乘法运算,广泛地发生在量子化学中。要在图形处理单元(GPU)上有效执行张量收缩,需要解决一些挑战,包括索引置换和较小的尺寸大小,从而降低线程块的利用率。此外,要将相同的优化应用于各种表达式,我们需要一个代码生成工具。在本文中,我们介绍了自动生成CUDA代码以在GPU上执行张量收缩的方法,包括管理CPU和GPU之间的数据移动。为了评估我们的工具,会为关键耦合簇方法CCSD(T)中最昂贵的收缩生成启用GPU的代码,并将其合并到流行的计算化学套件NWChem中。对于此方法,我们展示了使用一个GPU的速度比使用一个CPU内核的速度提高了8倍。使用2个GPU和5个内核(而不是7个内核)的混合CPU + GPU解决方案来利用整个系统时,速度提高了2. 6倍。每个节点)。我们进一步研究了一系列新的GPU(即Fermi GPU)上的张量收缩代码,并提供了几种有效的优化算法。对于相同的CCSD(T)计算,在具有Fermi GPU的集群上,与具有T10 GPU的集群相比,我们实现了3. 4的加速。在每个节点上只有一个Fermi GPU,与顺序CPU版本相比,我们的速度提高了43倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号