首页> 外文会议>IEEE International Conference on High Performance Computing >Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors
【24h】

Kernel-Assisted Communication Engine for MPI on Emerging Manycore Processors

机译:核心辅助通信引擎,用于MPI在新兴的多功能处理器上

获取原文

摘要

Manycore processors such as Intel Knights Landing (KNL), the second generation Xeon Phi many-core processor from Intel comes equipped with up to 288 threads and 16 gigabytes of high-bandwidth on-chip multi-channel DRAM (MCDRAM) that bear the potential to significantly improve the performance of both compute-bound and memory-bound applications. For this potential to be realized, it is imperative to exploit KNL's highly threaded environment and careful use of the limited MCDRAM resource. In this work, we focus on achieving effective utilization of KNL's resources through the design of a kernel-based communication engine that makes use of multiple kernel threads and generic work request abstraction scheme to accelerate MPI data movement operations. Being a kernel-based approach, our designs are application-pattern agnostic and aim to have minimal contention with the application's compute and memory requirements. We have compared our proposed designs with other prevalent schemes employed by modern MPI libraries. The experimental evaluation shows that the proposed designs provide up to 2.5X improvement at the microbenchmark-level and improve the total execution time of the MPI+OpenMP version of HPCG by up to 15% when compared with other approaches. Furthermore, using the CNTK Deep Learning framework, we demonstrate a significant improvement over existing approaches in the total training time (execution time) with the Multi-level perceptron (MLP) model and MNIST image recognition dataset.
机译:来自英特尔的Intel Knights Landing(Knl)等多芯处理器,Intel的第二代Xeon Phi多核处理器配备了最多288个线程和16千兆字节的高带宽片上芯片多通道DRAM(MCDRAM),具有潜力为了显着提高计算绑定和内存绑定应用程序的性能。对于这种潜力来实现,必须利用KNL的高度线程环境并仔细地使用有限的MCDRAM资源。在这项工作中,我们专注于通过设计基于内核的通信引擎来实现有效利用KNL资源,这些引擎利用多个内核线程和通用工作请求抽象方案来加速MPI数据移动操作。作为基于内核的方法,我们的设计是应用程序模式不可知论者,其目的是具有应用程序的计算和内存要求的最小争用。我们将建议设计与现代MPI库采用的其他普遍计划进行了比较。实验评估表明,与其他方法相比,拟议的设计在微稳态范围内提供高达2.5倍的改进,并将MPC + OpenMP版本的MPC + OpenMP版本的总执行时间提高到15 %。此外,使用CNTK深度学习框架,我们展示了通过多级Perceptron(MLP)模型和MNIST图像识别数据集的总培训时间(执行时间)中现有方法的显着改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号