首页> 外文期刊>Parallel Computing >MPI vs Fortran coarrays beyond 100k cores: 3D cellular automata
【24h】

MPI vs Fortran coarrays beyond 100k cores: 3D cellular automata

机译:超过10万核心的MPI与Fortran协同阵列:3D细胞自动机

获取原文
获取原文并翻译 | 示例

摘要

Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io, developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is network bound at scale. This is further evidenced by the fact that very aggressive cache and inter-procedural optimisations lead to no performance gain. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimisation is needed to narrow the performance gap between coarrays and MPI. (C) 2019 Elsevier B.V. All rights reserved.
机译:由于熟悉的Fortran语法,单面通信和在编译器中的实现,Fortran协数组是MPI的有吸引力的替代方案。在这项工作中,使用由作者开发的CASUP CA库https://cgpack.sourceforge.io构建的细胞自动机(CA)3D Ising磁化微型应用程序,将共阵列的缩放比例与MPI进行了比较。 Ising能量和磁化强度由MPI_ALLREDUCE和Fortran 2018 co_sum集合计算得出。这项工作是在ARCHER(Cray XC30)上完成的,直至整个机器容量:109,056核。对于MPI和消息大小从1B到几MB的协同阵列,乒乓延迟和带宽结果非常相似。 MPI光环交换(HX)的扩展性优于共阵列HX,这令人惊讶,因为两种算法都使用成对通信:MPI IRECV / ISEND / WAITALL与Fortran同步图像。即使排除了NUMA影响,将OpenMP添加到MPI或共阵列也会导致更差的L2缓存命中率,并且在所有情况下均会降低性能。这很可能是因为CA算法是大规模网络绑定的。事实证明,非常积极的缓存和过程间优化不会导致性能提升。采样和跟踪分析显示,所有miniapp的计算均具有良好的负载平衡,但通信不平衡,这表明MPI和协同阵列之间的性能差异可能是由于并行库(MPICH2与libpgas)和Cray硬件特定库(uGNI与DMAPP)。总体而言,对于超过100k内核的协同阵列使用,结果看起来很有希望。但是,需要进一步的协阵列优化以缩小协阵列和MPI之间的性能差距。 (C)2019 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号