首页> 外文期刊>Parallel Computing >MPI vs Fortran coarrays beyond 100k cores: 3D cellular automata
【24h】

MPI vs Fortran coarrays beyond 100k cores: 3D cellular automata

机译:MPI VS FORTRAN COARRAYS超过100K核心:3D蜂窝自动机

获取原文
获取原文并翻译 | 示例

摘要

Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io, developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is network bound at scale. This is further evidenced by the fact that very aggressive cache and inter-procedural optimisations lead to no performance gain. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimisation is needed to narrow the performance gap between coarrays and MPI. (C) 2019 Elsevier B.V. All rights reserved.
机译:由于熟悉的Fortran语法,单侧通信和编译器中的实现,Fortran Coarrays是MPI的有吸引力的替代方案。将COARRAYS的缩放与MPI进行比较,使用蜂窝自动机(CA)3D ising MINIAPPS,其中包括CASUP CA库,HTTPS://cgpack.sourceForge.io,由作者开发。使用MPI_AllReduce和Fortran 2018 Co_sum集体计算了能量和磁化。这项工作是在Archer(CRAY XC30)上完成的全部机器容量:109,056核心。 Ping-Pong延迟和带宽结果与MPI非常相似,并且具有从1B到几MB的消息大小的杂志。 MPI Halo Exchange(HX)比Coarray HX更好,这令人惊讶,因为这两种算法都使用成对通信:MPI IRECV / ISEND / WASTALL VS FORTRAN SYNC图像。将OpenMP添加到MPI或COARRASE,导致L2缓存命中率更差,并且在所有情况下都较低,即使排除了NUMA效果。这可能是因为CA算法是按比例绑定的网络。这进一步证明了非常激进的缓存和程序间优化导致无性收益的事实。采样和跟踪分析显示所有MinIAPP中的计算良好负载平衡,但通信中的不平衡,表明MPI和勾勒之间的性能差异可能是由于并行库(MPICH2 VS LibPGA)和CRAY硬件特定库(UGNI VS) DMAPP)。总体而言,结果对于携带植物的核心队伍超越了100K核心。然而,需要进一步的Coarray优化来缩小杂面和MPI之间的性能差距。 (c)2019 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号