首页> 外文会议>2011 International Conference for High Performance Computing, Networking, Storage and Analysis >High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach
【24h】

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

机译:使用缓存友好的混合线程MPI方法的基于多核的并行系统的高性能点阵QCD

获取原文

摘要

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2–3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 323 × 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.
机译:晶格量子色动力学(LQCD)是一个计算难题,可以解决存在SU(3)规范场的离散Dirac方程。它的关键操作是矩阵向量乘积,称为Dslash运算符。我们开发了Wilson-Dslash运算符的新颖的多核体系结构友好实现,可在Intel®Xeon®Processor X5680上提供75 Gflops(单精度),从而使适合最后一级缓存的数据集的计算效率达到60%。对于大于最后一级缓存的数据集,此性能将下降至50 Gflops。当在相同的硬件平台上运行时,我们的性能比Chroma软件套件中的著名实现高出2到3倍。本文报道的LQCD的新颖实现基于最近发布的3.5D空间和4.5D时间切片方案。两种阻塞方案均显着降低了LQCD外部存储器带宽需求,从而提供了更多的计算约束实现。随着计算触发器和外部存储器带宽之间的差距不断扩大,我们的方案的性能优势将变得更加重要。我们展示了实现的非常好的群集级可伸缩性:对于32 3 ×256个站点的网格,当强扩展到128个节点系统(总共1536个内核)时,我们可以实现超过4个Tflops。对于相同的晶格大小,完整的共轭梯度Wilson-Dslash运算符可实现2.95 Tflops。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号