首页> 外文期刊>Experimental Mechanics >Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors
【24h】

Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors

机译:具有现代英特尔处理器的共享内存平台异构模板的性能平行编程

获取原文
获取原文并翻译 | 示例
           

摘要

In this work, we take up the challenge of performance portable programming of heterogeneous stencil computations across a wide range of modern shared-memory systems. An important example of such computations is the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), the second major part of the dynamic core of the EULAG geophysical model. For this aim, we develop a set of parametric optimization techniques and four-step procedure for customization of the MPDATA code. Among these techniques are: islands-of-cores strategy, (3+1)D decomposition, exploiting data parallelism and simultaneous multithreading, data flow synchronization, and vectorization. The proposed adaptation methodology helps us to develop the automatic transformation of the MPDATA code to achieve high sustained scalable performance for all tested ccNUMA platforms with Intel processors of last generations. This means that for a given platform, the sustained performance of the new code is kept at a similar level, independently of the problem size. The highest performance utilization rate of about 41-46% of the theoretical peak, measured for all benchmarks, is provided for any of the two-socket servers based on Skylake-SP (SKL-SP), Broadwell, and Haswell CPU architectures. At the same time, the four-socket server with SKL-SP processors achieves the highest sustained performance of around 1.0-1.1 Tflop/s that corresponds to about 33% of the peak.
机译:在这项工作中,我们占据了各种现代共享内存系统的异构模板计算的性能便携式编程的挑战。这种计算的一个重要示例是多维正定的前导传输算法(MPData),eulag地球物理模型的动态核的第二主要部分。为此目的,我们开发了一组参数优化技术和用于自定义MPData代码的四步过程。这些技术是:核心群岛策略,(3 + 1)D分解,利用数据并行性和同时多线程,数据流同步和矢量化。所提出的适应方法有助于我们开发MPData代码的自动变换,以实现具有上一代英特尔处理器的所有测试的CCNUMA平台的高持续可扩展性能。这意味着对于给定的平台,新代码的持续性能被保持在类似的级别,独立于问题大小。对于所有基准测试的理论峰值的最高性能利用率约为41-46%,为基于Skylake-SP(SKL-SP),Broadwell和Haswell CPU架构的任何双套接字服务器提供了任何基准。与此同时,具有SKL-SP处理器的四个套接字服务器达到持续性能约为1.0-1.1 TFLOP / S,其对应于峰值的约33%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号