首页> 外文会议>2011 International Conference for High Performance Computing, Networking, Storage and Analysis >High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

【24h】

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

机译：使用缓存友好的混合线程MPI方法的基于多核的并行系统的高性能点阵QCD

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers 75 Gflops (single-precision) on an Intel® Xeon® Processor X5680 achieving 60% computational efficiency for datasets that fit in the last-level cache. For datasets larger than the last-level cache, this performance drops to 50 Gflops. Our performance is 2–3X higher than a well-known implementation from the Chroma software suite when running on the same hardware platform. The novel implementation of LQCD reported in this paper is based on recently published the 3.5D spatial and 4.5D temporal tiling schemes. Both blocking schemes significantly reduce LQCD external memory bandwidth requirements, delivering a more compute-bound implementation. The performance advantage of our schemes will become more significant as the gap between compute flops and external memory bandwidth continues to grow. We demonstrate very good cluster-level scalability of our implementation: for a lattice of 32³ × 256 sites, we achieve over 4 Tflops when strong-scaled to a 128 node system (1536 cores total). For the same lattice size, a full Conjugate Gradients Wilson-Dslash operator, achieves 2.95 Tflops.

机译：晶格量子色动力学（LQCD）是一个计算难题，可以解决存在SU（3）规范场的离散Dirac方程。它的关键操作是矩阵向量乘积，称为Dslash运算符。我们开发了Wilson-Dslash运算符的新颖的多核体系结构友好实现，可在Intel®Xeon®Processor X5680上提供75 Gflops（单精度），从而使适合最后一级缓存的数据集的计算效率达到60％。对于大于最后一级缓存的数据集，此性能将下降至50 Gflops。当在相同的硬件平台上运行时，我们的性能比Chroma软件套件中的著名实现高出2到3倍。本文报道的LQCD的新颖实现基于最近发布的3.5D空间和4.5D时间切片方案。两种阻塞方案均显着降低了LQCD外部存储器带宽需求，从而提供了更多的计算约束实现。随着计算触发器和外部存储器带宽之间的差距不断扩大，我们的方案的性能优势将变得更加重要。我们展示了实现的非常好的群集级可伸缩性：对于32 ^{3 ×256个站点的网格，当强扩展到128个节点系统（总共1536个内核）时，我们可以实现超过4个Tflops。对于相同的晶格大小，完整的共轭梯度Wilson-Dslash运算符可实现2.95 Tflops。}

著录项

来源
《2011 International Conference for High Performance Computing, Networking, Storage and Analysis 》|2011年|p.1-10|共10页
会议地点
作者
Smelyanskiy Mikhail; Vaidyanathan Karthikeyan; Choi Jee; Joo Balint; Chhugani Jatin; Clark Michael A.; Dubey Pradeep;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络 ;
关键词

相似文献

外文文献
中文文献
专利

1. An iteration-based hybrid parallel algorithm for tridiagonal systems of equations on multi-core architectures [J] . Guangping Tang, Wangdong Yang, Kenli Li, Concurrency, practice and experience . 2015 ,第17期

机译：三核对角线方程组基于迭代的混合并行算法
2. A parallel way of data decomposition approach for ANN based image reconstruction in e-MRI on a multi-core computer system [J] . Subramanian Kartheeswaran, Daniel Dharmaraj Christopher Durairaj Informatics in Medicine Unlocked . 2017 ,第1期

机译：多核计算机系统上基于并行神经网络的e-MRI图像重建的数据分解方法
3. A data-parallelism approach for PSO-ANN based medical image reconstruction on a multi-core system [J] . Subramanian Kartheeswaran, Daniel Dharmaraj Christopher Durairaj Informatics in Medicine Unlocked . 2017 ,第1期

机译：多核系统上基于PSO-ANN的医学图像重建的数据并行方法
4. High-Performance Lattice QCD for Multi-core Based Parallel Systems Using a Cache-Friendly Hybrid Threaded-MPI Approach [C] . Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Jee Choi, International Conference for High Performance Computing, Networking, Storage and Analysis . 2011

机译：使用高速缓存友好的混合线-MPI方法的多核基于并行系统的高性能格QCD
5. Parallel subgraph mining on hybrid platforms: HPC systems, multi-cores and GPUs. [D] . Talukder, Nilothpal. 2016

机译：混合平台上的并行子图挖掘：HPC系统，多核和GPU。
6. A Multi-Core Parallelization Strategy for Statistical Significance Testing in Learning Classifier Systems [O] . James Rudd, Jason H. Moore, Ryan J. Urbanowicz -1

机译：学习分类器系统中统计意义测试的多核并行化策略
7. A data-parallelism approach for PSO-ANN based medical image reconstruction on a multi-core system [O] . Subramanian Kartheeswaran, Daniel Dharmaraj Christopher Durairaj 2017

机译：基于psO-aNN的多核系统医学图像重建的数据并行方法

High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

摘要

著录项

相似文献

相关主题

期刊订阅