Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

机译：优化多核系统的阻塞和无阻塞还原操作：分层设计和实现

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI_Allreduce and MPI_Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI_Allreduce and MPI_Reduce operations (and its nonblocking variants MPI_Iallreduce and MPI_Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions. The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini-Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradie- t solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.

机译：许多科学模拟，使用消息传递接口（MPI）编程模型对减少集体操作的性能和可扩展性，例如MPI_allReduce和MPI_Reduce。这些操作是最广泛使用的抽象，以满足是仿真一部分的所有进程的数学操作。在这项工作中，我们提出了一种分层设计来实现多核系统的减少操作。该设计旨在提高减少效率1）定制算法并定制系统中的各种通信机制的实现，提供配置层次结构的深度以匹配系统架构，3）提供能力独立地进展此类层次结构。使用此设计，我们为所有消息大小实现MPI_AllReduce和MPI_Reduce MPI_IANCE和MPI_IVERUCE和MPI_URDUCE），并评估多个体系结构，包括INFINIBAND和CRAY XT5。我们利用并增强了我们现有的基础设施，猎豹，这是实施分层集体运营的框架，以实现这些减少。实验结果表明，猎豹减少操作越高，售得级MPI实现，如打开MPI默认，CRAY MPI和MVAPICH2，展示其效率，灵活性和可移植性。在Infini频段系统上，通过微稳定系统，与默认打开MPI缩减相比，使用Microbenchmark无阻塞已分别为23倍和10x的加速度分别实现了23倍和10倍。减少操作的阻塞变体也显示出类似的性能益处。与默认MVAPICH2已解雷德实施相比，512过程非阻塞猎豹删除了3倍的加速。在CRAY XT5系统上，6144过程CHETAH已达到CRAY MPI的优势145％。使用申请核，共轭阶段T求解器的评价表明，猎豹减少速度将总时间加速为195％，展示了科学模拟的潜在益处。

著录项

来源
《IEEE International Conference on Cluster Computing》|2013年|1-8|共8页
会议地点
作者
Venkata Manjunath Gorentla; Shamis Pavel; Sampath Rahul; Graham Richard L.;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Allreduce; Collectives; HPC; MPI; Reduce;

机译：Allreduce;集体词; HPC; MPI; Reduce;

相似文献

外文文献
中文文献
专利

1. Optimization of energy supply systems by MILP branch and bound method in consideration of hierarchical relationship between design and operation [J] . Ryohei Yokoyama, Yuji Shinano, Syusuke Taniguchi, Energy Conversion & Management . 2015,第mara期

机译：考虑设计与运行层次关系的MILP分支定界法优化能源供应系统
2. Design and implementation of speed fluctuation reduction for a ball screw feed system at low-speed operation [J] . Zhaoguo Wang, Xianying Feng, Fuxin Du, Mechanical Sciences . 2020,第1期

机译：低速运行滚珠螺旋进料系统速度波动减小的设计与实现
3. A Knowledge Base System for Operation Optimization: Design and Implementation Practice for the Polyethylene Process [J] . Weimin Zhong, Chaoyuan Li, Xin Peng, 工程（英文） . 2019,第006期

机译：优化运营的知识库系统：聚乙烯工艺的设计和实施实践
4. Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation [C] . Venkata Manjunath Gorentla, Shamis Pavel, Sampath Rahul, IEEE International Conference on Cluster Computing . 2013

机译：优化多核系统的阻塞和非阻塞操作：分层设计和实现
5. Computing hierarchical nonblocking discrete event systems. [D] . Cheung, Bowie. 2008

机译：计算分层的非阻塞离散事件系统。
6. Evaluation of a Multicore-Optimized Implementation for Tomographic Reconstruction [O] . Jose-Ignacio Agulleiro, José Jesús Fernández -1

机译：层析成像重构多核优化的实施评价
7. Optimization of energy supply systems by MILP branch and bound method in consideration of hierarchical relationship between design and operation [O] . Yokoyama Ryohei, Shinano Yuji, Taniguchi Syusuke, 2015

机译：考虑设计与运行层次关系的MILP分支定界法优化能源供应系统

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅