Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

机译：提高英特尔至强融核协处理器群集上本机应用程序的通信性能和可伸缩性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.

机译：基于Intel Xeon Phi协处理器的群集可为并行工作负载提供高性能的计算和内存性能，并且还支持直接网络访问。许多现实世界的应用程序会受到网络特性的极大影响，并且要使这些应用程序在这些群集上的性能最大化，有效地饱和网络带宽和/或隐藏通信等待时间尤为重要。我们演示了如何使用流水线DMA等技术来进行数据传输，动态块大小调整以及更好的异步进度。我们还将展示一种方法，以及在应用程序通信阶段避免序列化和最大化并行性的影响。此外，我们应用了专注于平衡计算和通信的应用程序优化，以隐藏通信延迟并提高内核和网络带宽的利用率。我们展示了我们的技术对在英特尔至强融核协处理器上本地运行的三个众所周知且经过高度优化的HPC内核的影响。对于莱迪思QCD的Wilson-Dslash运算符，我们描述了每种通信性能优化所带来的改进，应用了在通信阶段最大限度地提高并发性的方法，并显示出比以前最佳发布的结果总体提高了48％。对于HPL / LINPACK，我们在128个Intel Xeon Phi协处理器上显示97个TFLOP的效率为68.5％，这是有史以来第一个报告基于协处理器的超级计算机的本机HPL效率。对于FFT，我们显示了TACC Stampede群集上使用1024个Intel Xeon Phi协处理器的10.8个TFLOP，这是所有基于Intel体系结构的群集上报告的最高性能，并且是在基于协处理器的超级计算机上报告的首个此类结果。

著录项

来源
《IEEE International Parallel Distributed Processing Symposium》|2014年|1083-1092|共10页
会议地点
作者
Vaidyanathan Karthikeyan; Pamnany Kiran; Kalamkar Dhiraj D.; Heinecke Alexander; Smelyanskiy Mikhail; Park Jongsoo; Kim Daehyun; Shet Aniruddha; G; Kaul Bharat; Joo Balint; Dubey Pradeep;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
FFT; HPL; Intel Xeon Phi coprocessor clusters; Lattice QCD; native applications;

机译：FFT; HPL;英特尔至强融核协处理器集群;莱迪思QCD;本机应用;

相似文献

外文文献
中文文献
专利

1. High-level Support for Hybrid Parallel Execution of C++ Applications Targeting Intel? Xeon Phi? Coprocessors [J] . Jiri Dokulil, Enes Bajrovic, Siegfried Benkner, Procedia Computer Science . 2013,第1期

机译：针对Intel的C ++应用程序的混合并行执行的高级支持？至强皮协处理器
2. Parallel Mutual Information Based Construction of Genome-Scale Networks on the Intel®Xeon Phi™ Coprocessor [J] . Misra Sanchit, Pamnany Kiran, Aluru Srinivas Computational Biology and Bioinformatics, IEEE/ACM Transactions on . 2015,第5期

机译：基于并行互信息的英特尔®至强融核™协处理器上的基因组规模网络的构建
3. Asynchronous and synchronous models of executions on Intel~® Xeon Phi~(TM) coprocessor systems for high performance of long wave radiation calculations in atmosphere models [J] . Amlesh Kashyap, Sathish S. Vadhiyar, Ravi S. Nanjundiah, Journal of Parallel and Distributed Computing . 2017,第Apra期

机译：Intel〜Xeon Phi〜（TM）协处理器系统的异步和同步模型，用于大气模型的长波辐射计算高性能
4. Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters [C] . Vaidyanathan Karthikeyan, Pamnany Kiran, Kalamkar Dhiraj D., IEEE International Parallel Distributed Processing Symposium . 2014

机译：提高Intel Xeon Phi Coprocessor集群上本机应用程序的通信性能和可扩展性
5. Advancing LAMMPS Performance on Intel Xeon Phi Processors Coprocessors [D] . Vorsu, Sandeep Kumar. 2017

机译：在英特尔Xeon Phi处理器协处理器上推进LAMMPS性能
6. Comparative Performance Analysis of Intel Xeon Phi GPU and CPU: A Case Study from Microscopy Image Analysis [O] . George Teodoro, Tahsin Kurc, Jun Kong, -1

机译：英特尔至强融核GPU和CPU的比较性能分析：以显微镜图像分析为例
7. High-level Support for Hybrid Parallel Execution of C++ Applications Targeting Intel® Xeon Phi™ Coprocessors [O] . Dokulil Jiri, Bajrovic Enes, Benkner Siegfried, 2013

机译：针对以英特尔®至强融核™协处理器为目标的C ++应用程序的混合并行执行的高级支持

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters

摘要

著录项

相似文献

相关主题

期刊订阅