首页> 外文会议>IEEE International Conference on Cluster Computing >Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation
【24h】

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

机译:优化多核系统的阻塞和无阻塞还原操作:分层设计和实现

获取原文

摘要

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI_Allreduce and MPI_Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI_Allreduce and MPI_Reduce operations (and its nonblocking variants MPI_Iallreduce and MPI_Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini-Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradie- t solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.
机译:许多科学模拟,使用消息传递接口(MPI)编程模型对减少集体操作的性能和可扩展性,例如MPI_allReduce和MPI_Reduce。这些操作是最广泛使用的抽象,以满足是仿真一部分的所有进程的数学操作。在这项工作中,我们提出了一种分层设计来实现多核系统的减少操作。该设计旨在提高减少效率1)定制算法并定制系统中的各种通信机制的实现,提供配置层次结构的深度以匹配系统架构,3)提供能力独立地进展此类层次结构。使用此设计,我们为所有消息大小实现MPI_AllReduce和MPI_Reduce MPI_IANCE和MPI_IVERUCE和MPI_URDUCE),并评估多个体系结构,包括INFINIBAND和CRAY XT5。我们利用并增强了我们现有的基础设施,猎豹,这是实施分层集体运营的框架,以实现这些减少。实验结果表明,猎豹减少操作越高,售得级MPI实现,如打开MPI默认,CRAY MPI和MVAPICH2,展示其效率,灵活性和可移植性。在Infini频段系统上,通过微稳定系统,与默认打开MPI缩减相比,使用Microbenchmark无阻塞已分别为23倍和10x的加速度分别实现了23倍和10倍。减少操作的阻塞变体也显示出类似的性能益处。与默认MVAPICH2已解雷德实施相比,512过程非阻塞猎豹删除了3倍的加速。在CRAY XT5系统上,6144过程CHETAH已达到CRAY MPI的优势145%。使用申请核,共轭阶段T求解器的评价表明,猎豹减少速度将总时间加速为195%,展示了科学模拟的潜在益处。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号