首页> 外文期刊>Parallel Computing >Unstructured computational aerodynamics on many integrated core architecture
【24h】

Unstructured computational aerodynamics on many integrated core architecture

机译:许多集成核心架构上的非结构化计算空气动力学

获取原文
获取原文并翻译 | 示例

摘要

Shared memory parallelization of the flux kernel of PETSc-FUN3D, an unstructured tetrahedral mesh Euler flow code previously studied for distributed memory and multi-core shared memory, is evaluated on up to 61 cores per node and up to 4 threads per core. We explore several thread-level optimizations to improve flux kernel performance on the state-of-the-art many integrated core (MIC) Intel processor Xeon Phi "Knights Corner," with a focus on strong thread scaling. While the linear algebraic kernel is bottlenecked by memory bandwidth for even modest numbers of cores sharing a common memory, the flux kernel, which arises in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, is compute-intensive and is known to exploit effectively contemporary multi-core hardware. We extend study of the performance of the flux kernel to the Xeon Phi in three thread affinity modes, namely scatter, compact, and balanced, in both offload and native mode, with and without various code optimizations to improve alignment and reduce cache coherency penalties. Relative to baseline "out-of-the-box" optimized compilation, code restructuring optimizations provide about 3.8x speedup using the offload mode and about 5x speedup using the native mode. Even with these gains for the flux kernel, with respect to execution time the MIC simply achieves par with optimized compilation on a contemporary multi-core Intel CPU, the 16-core Sandy Bridge E5 2670. Nevertheless, the optimizations employed to reduce the data motion and cache coherency protocol penalties of the MIC are expected to be of value for CFD and many other unstructured applications as many-core architecture evolves. We explore large-scale distributed-shared memory performance on the Cray XC40 supercomputer, to demonstrate that optimizations employed on Phi hybridize to this context, where each of thousands of nodes are comprised of two sockets of Intel Xeon Haswell CPUs with 32 cores per node. (C) 2016 Elsevier B.V. All rights reserved.
机译:PETSc-FUN3D的磁通内核的共享内存并行化是以前针对分布式内存和多核共享内存而研究的非结构化四面体网格Euler流代码,每个节点最多可评估61个核心,每个核心最多4个线程。我们探索了几种线程级别的优化,以提高最新的集成内核(MIC)Intel处理器Xeon Phi“ Knights Corner”的磁通内核性能,重点关注强大的线程扩展。对于共享共享内存的偶数个核心来说,线性代数核受内存带宽的瓶颈影响,而通量核则出现在守恒律残量的控制量离散化以及雅可比矩阵的前置条件通过有限-有限元的形成中。区分守恒定律残差,计算量大并且已知可以有效利用现代多核硬件。我们将在三种线程亲和力模式下(分散,紧凑和平衡)在卸载和纯模式下对Xeon Phi的磁通内核的性能进行研究,无论有无代码优化,以改善对齐方式并减少缓存一致性损失。相对于基准“即开即用”的优化编译,代码重组优化在卸载模式下提供了约3.8倍的加速,在纯模式下提供了约5倍的加速。即使在磁通量内核方面取得了这些收益,就执行时间而言,MIC仍然可以与现代多核Intel CPU 16核Sandy Bridge E5 2670上的优化编译相媲美。尽管如此,采用了优化措施可以减少数据移动随着多核架构的发展,MIC的缓存一致性协议惩罚措施有望对CFD和许多其他非结构化应用程序产生价值。我们在Cray XC40超级计算机上探索了大型分布式共享内存的性能,以证明在Phi上进行的优化与这种环境相结合,其中数千个节点中的每个节点由两个Intel Xeon Haswell CPU插槽组成,每个插槽具有32个内核。 (C)2016 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号