首页> 外文会议>Recent advances in the message passing interface >Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows
【24h】

Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows

机译:使用共享内存支持的Windows在多核InfiniBand群集上优化MPI单侧通信

获取原文
获取原文并翻译 | 示例

摘要

The Message Passing Interface (MPI) has been very popular for programming parallel scientific applications. As the multi-core architectures have become prevalent, a major question that has emerged is about the use of MPI within a compute node and its impact on communication costs. The one-sided communication interface in MPI provides a mechanism to reduce communication costs by removing matching requirements of the send/receive model. The MPI standard provides the flexibility to allocate memory windows backed by shared memory. However, state-of-the-art open-source MPI libraries do not leverage this optimization opportunity for commodity clusters. In this paper, we present a design and implementation of intra-node MPI one-sided interface using shared memory backed windows on multi-core clusters. We use MVAPICH2 MPI library for design, implementation and evaluation. Micro-benchmark evaluation shows that the new design can bring up to 85% improvement in Put, Get and Accumulate latencies, with passive synchronization mode. The bandwidth performance of Put and Get improves by 64% and 42%, respectively. Splash LU benchmark shows an improvement of up to 55% with the new design on 32 core Magny-cours node. It shows similar improvement on a 12 core Westmere node. The mean BFS time in Graph500 reduces by 39% and 77% on Magny-cours and Westmere nodes, respectively.
机译:消息传递接口(MPI)在并行科学应用程序的编程中非常受欢迎。随着多核体系结构的普及,出现的一个主要问题是关于在计算节点内使用MPI及其对通信成本的影响。 MPI中的单面通信接口提供了一种机制,可通过消除发送/接收模型的匹配要求来降低通信成本。 MPI标准提供了分配共享内存支持的内存窗口的灵活性。但是,最新的开源MPI库并未利用商品集群的这种优化机会。在本文中,我们介绍了在多核群集上使用共享内存支持的窗口的节点内MPI单侧接口的设计和实现。我们使用MVAPICH2 MPI库进行设计,实施和评估。微基准评估表明,新设计可以通过被动同步模式将Put,Get和Accumulation延迟提高85%。 Put和Get的带宽性能分别提高了64%和42%。通过在32核心Magny-cours节点上进行的新设计,Splash LU基准测试显示最高可提高55%。它在12核Westmere节点上显示了类似的改进。在Magny-cours和Westmere节点上,Graph500中的平均BFS时间分别减少了39%和77%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号