首页> 外文期刊>Parallel Computing >Efficient design for MPI asynchronous progress without dedicated resources
【24h】

Efficient design for MPI asynchronous progress without dedicated resources

机译:没有专用资源的MPI异步进度的高效设计

获取原文
获取原文并翻译 | 示例

摘要

The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPLTest). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries, including MVAPICH2, Intel MPI, and Open MR. We demonstrate the benefits of the proposed approach at microbenchmark level and at the application level at scale on five different architectures, including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows up to 60%, 36%, and 50% improvements for All-to-one, One-to-all, and All-to-all communication patterns respectively at 816 processes. Our design shows 41% performance improvement for SPEC MPI compute-intensive applications, 63% performance improvement for P3DFFT application, and 28% higher throughput for HPL. Published by Elsevier B.V.
机译:计算和通信的重叠对于许多HPC应用程序的性能至关重要。用于异步进程的最先进的设计需要特殊设计的硬件资源(高级交换机或网络接口卡),专用处理器核心或应用程序修改(例如,使用MPLTEST)。这些技术遭受了各种问题,如提高代码复杂性/成本和可用计算资源的损失,可用于最终应用程序。在本文中,我们提出了这一挑战并提出了一种简单但有效的技术来实现良好的重叠,而无需任何额外的硬件或软件资源。所提出的基于线程的设计允许MPI库自检时需要异步通信进度并最小化主线程和异步进度线之间的上下文交换机和抢占的数量。我们在其他MPI库中评估了针对最先进的设计的拟议设计,包括MVAPICH2,Intel MPI和开放MR。我们展示了拟议的方法在微稳态级别和应用级别在五大不同架构中的应用程序级别,包括英特尔Broadwell,英特尔Xeon Phi(Knl),IBM OpenPower和Intel Skylake,与Infiniband和Omni-Path互连。与其他最先进的设计相比,我们所提出的方法分别显示出全对一对一,一体化和全部通信模式的60%,36%和50%的改进在816个进程。我们的设计显示了规范MPI计算密集型应用的41%性能改进,P3DFFT应用的63%的性能改进,HPL的吞吐量较高了28%。由elsevier b.v出版。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号