【24h】

Optimizing a Multiple Right-Hand Side Dslash Kernel for Intel Knights Corner

机译:为英特尔骑士角优化多个右侧斜线内核

获取原文

摘要

There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse linear system using a Wilson Dslash kernel, which has an arithmetic intensity of 0.88-2.29. This makes Dslash memory bandwidth-bound on most architectures, including Intel Xeon Phi Knights Corner (KNC). Most research optimizing the Dslash operator has been focused on single right-hand side (SRHS) linear solvers. There is a class of LQCD computations which aims to solve systems with multiple right-hand sides (MRHS), presenting additional opportunities for data reuse and vectorization. We present two approaches to MRHS Dslash: a vector register blocking approach and one using the software package QPhiX with a custom code generator for low-level intrinsics. We observed significant speedups using our approaches, with sustained performance of over 700 GFLOPS (single precision) in one instance. We achieved up to 29 % of theoretical peak performance compared to a maximum of 13 % obtained by the previous SRHS method using QPhiX.
机译:在计算物理学界中,人们非常有兴趣进行晶格量子色动力学(LQCD)模拟,这种模拟可能涉及数万亿次运算。 LQCD计算使用Wilson Dslash核解决了稀疏线性系统,该核的算术强度为0.88-2.29。这使Dslash内存在大多数架构上都受带宽限制,包括Intel Xeon Phi Knights Corner(KNC)。大多数优化Dslash运算符的研究都集中在单个右侧(SRHS)线性求解器上。一类LQCD计算旨在解决具有多个右侧(MRHS)的系统,为数据重用和矢量化提供了更多的机会。我们为MRHS Dslash提供了两种方法:一种向量寄存器阻止方法,以及一种使用QPhiX软件包和一个用于底层内在函数的自定义代码生成器的方法。我们使用我们的方法观察到了显着的加速,在一种情况下具有超过700 GFLOPS(单精度)的持续性能。与以前使用QPhiX的SRHS方法获得的最高13%的理论峰值性能相比,我们获得了高达29%的理论峰值性能。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号